Sci Cloj Meetings, 2 Nov 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Healthcare Data Science in Clojure - Scicloj meeting 15

Description

This was our first in a series of public meetings about Clojure and data science in healthcare and medicine.

In this meeting, the main theme was knowledge management.
* Sivaram Arabandi: "Biomedical Ontologies - Design Patterns and Applications"
* Pier Federico Gherardini: "CANDEL: A platform for biological data science using Clojure, R, and Datomic"
* Discussion

Moderator: João Santiago

The text conversation was quite active during this meeting. You may find it usetul to read through the text chat: https://tinyurl.com/y4ks7f6o

Clojurians Zulip discussion:
https://clojurians.zulipchat.com/#narrow/stream/151924-data-science/topic/healthcare.20meeting.20.231.3A.20knowledge.20management

A

So welcome warm welcome to everyone that has joined so far. This is the first public data science meeting where we are combining the language you love closure with the very exciting topic of healthcare.

A

My name is john santiago or just like, I said santiago. For short, I am a medical doctor currently working as a data scientist in berlin, germany and I'll be moderating the meeting today before we start with the actual presentations I'd like to give everyone a couple of guidelines to make the meeting as pleasant as possible. For for everyone, please mute yourself when you're, not speaking, um you can post and please do post any questions at any time. In the chat, I will be collecting the questions throughout the meeting and after each speaker's presentation.

A

I will pick one or two um questions to to present to the speakers and then at the end we will go over the ones that were not selected during the open discussion.

A

The session is recorded and we will, I think, daniel, put this on youtube later on.

A

So if you don't feel comfortable having your camera on for any reason, for example, you're free to not do that, although I would invite everyone to have the camera on during the session so that the speakers are not speaking into the cold and empty zoom avatars.

A

If, during the the meeting, anyone has any issue, you can always contact me via private chat or daniel slutsky here in zoom, and we will try to figure out how to help you so tonight or today, depending on your time zone. We have two speakers. The first one will be sivaram arabandi.

A

I hope I pronounce your name correctly: perfect he's a medical doctor and computer scientist working in the field of biomedical ontologies, helping integrate and transform biomedical and clinical data so that it can be used in cool applications like decision systems and nlp and hello sivarami everybody online hello, and we also have pierre gera gerardini hope I got that right, he's currently the director of informatics that the parker institute for cancer immunotherapy, where we focus on implementation of advanced essay technologies, for the analysis of clinical samples and methods to visualize and analyze data.

A

A

So without further ado, I would give the stage to you um sivaram, to talk about biomedical ontologies.

B

Okay, thank you, santiago. Let me start sharing my screen.

B

Okay, you should be seeing my screen now I'll, try and minimize this yeah. It's perfect and start going to presentation mode.

B

Okay, can you see my screen.

A

Yeah perfect all right.

B

uh Thank you, santiago and uh thank you daniel for setting this up. It's a pleasure to be speaking here at the uh the closure for science and the philosopher data science group, and today I will be talking about biomedical ontologies.

B

uh Some design patterns and applications uh so I'll be going over a little bit of ontology basics, because I think from some of the previous conversations it looked like the you know, most people are not familiar with ontologies, so bear with me. Okay, let's uh start with taking a look at what a clinical case looks like here's an example: 60 years old male patient, presented with a sudden onset of chest, pain, nausea, sweating and dyspnea.

B

Having a past history of mitral valve stenosis ecg showed st elevation. There were some labs done, troponin and ckmb. A diagnosis of myocardial infarction was made and thrombolytic therapy was initiated.

B

Angiography was performed with the demonstrated left, anterior descending artery stenosis more than 90, and the patient was stented. So this is a kind of a typical note. You would see now, if you uh annotate this or if you classify this into certain kinds of things here, we have you know different kinds of data elements like 60 years: male chest, pain, nausea, sweating, dyspnea, mitral valve stenosis and all now, if we categorize this a little further right. This is what they look like.

B

So we have some demographic kind of information like age and gender, some symptoms and science, lab information, diagnosis and some management aspects that are there, and these come in different kinds of data types right. There are some continuous uh type of data like the values and that have measurement units and some range and things like that, like age, weight, troponin fev1, which is the respiratory volume measurement, and then on the other hand, we have some categorical types of values like sex or diagnosis, medications, etc.

B

Now boolean is a special type of categorical right where you have like a true or false value. For this and certain things, like you know, does the patient have a past history of stroke? Is there a history of coughing, right, you're going to say true or false there and this so uh the scope of all this is that we have a variety of uh types of different information right. We have we're looking at clinical care uh data, symptom, science, diagnosis, labs, etc, information from research, registries, studies, clinical trials, etc.

B

Some demographic information and some genomics information. So this presents us with a number of challenges. Data challenges right. The good news is that it's there's a lot of data available for us. The bad news is that the data is it's all segmented, it's all fragmented in silos and in addition, there are because we are dealing with a number of different data sources.

B

uh We we have structural differences right because of the localized database, schemas that are used uh and even within that, how well structured is it right and you might get some text data as well. So it is an integration uh challenge for us for sure, but that's only the beginning, because we then have to we're also faced with the the challenges with the semantics, the meaning of the data. So we have uh we'll be having different labels that mean the same thing in different databases.

B

For example, if you uh take male gender right, it could be recorded as uh as man it could be as male as m as one as zero. So there are different variations in uh how this can be uh recorded or for that matter, even within us within a single table.

B

You can come across variables like uh beta1 and beta2 right and to a clinician if you're talking about beta1 and beta2 blockers, you're thinking about beta1, selective blocker beta, two selective blockers, two different types of medications, but in this case in the in one of the examples uh that use cases that we had seen, uh beta1 and beta2 both represented beta blockers without a diuretic, and they actually represented different visits right.

B

The first visit called beta1 and the second call was called beta2, so that was a complete curveball for us when we found out that this information from one of the smes on the project, so now we have, the same label can be used with different meanings. So previously we looked at different labels. That could mean the same thing, but now different labels.

B

Different meanings can be given to us a single label and, for example, the word cold.

B

What does it mean to you? Well, we have at least three different meanings right. One is, uh you could be feeling cold uh the it could be a cold infection which is a uh infectious disease or it could be a chronic obstructive lung disease, which is an other synonym for chronic obstructive, pulmonary disease or copd also, and then we have things like hypopnea, which is uh breathing. That is very shallow or has an abnormally low respiratory rate right.

B

So the the english definition looks very simple, but when we look at the operational definitions, the american academy of sleep medicine has two definitions. There is a recommended definition which says: airflow reduction greater than 30 of baseline, lasting for 10 seconds and a hemoglobin oxygen desaturation of greater than or equal to four percent from baseline. Now. This is one of the definitions.

B

There's an alternate definition so, instead of 30 percent they'll, say greater than or equal to 50 percent again for 10 seconds, but with a hemoglobin desaturation of greater than or equal to 3, instead of 4 and from one of the projects that I worked on, we know that there are at least 10 other known definitions of this, so the challenges for us when it comes to data can be summarized in this way. We have to deal with some simple, as well as complex definitions.

B

We need to deal with overlapping definitions like we saw with the hypomia operational definition. We need to deal with data integration challenges both across applications and across domains. uh We need to be able to handle query of uh these different data sources and also make the data suitable for discovery. Now. This is exactly the situation where ontologies have proven to be extremely extremely useful.

B

So what does an ontology stand for an ontology is a formal representation of knowledge as a set of concepts within a domain and the relationships between them.

B

So in this presentation, I'm going to talk about what is an ontology uh I'll, give a small demo of what you can of some of the benefits of using in ontology and we'll also look at some ontology patterns and uses okay. So, let's start with what is an ontology again going back to this definition? An ontology is a formal representation of knowledge as a set of concepts within a domain and the relationships between those concepts.

B

So when you're, looking at uh thinking of concepts, concepts are to do with reality and concepts within domains within that reality, so the so we there are. There are if this is a semiotic triangle. Some of you are probably familiar with this and uh the the so this is the real world around us and we refer to the real world using different terms like city or a capital or diseases, and things like that and for the terms that we are using.

B

It invokes a certain concept, and this is in our mind. This is what we are thinking and what we think is representing a model of the world. That's now putting it in a different way.

B

Again, we have the world and, as we interpret it, this is there in our thoughts and we encoded our trans or used symbols or words to refer to it.

B

Now, when we communicate these uh words to other people, it uh they are thinking about it and it evokes a certain concept, and that is about the world around us. So this is the semiotic triangle about how we interpret reality. How we talk about reality and how we conceptualize reality now.

B

Unfortunately, reality is uh is, is not not so easy to comprehend. Right, there's often uh often a lot of things that lost in translation. So this is a a program or humor um a little bit of program humor here where a customer gave the requirements for a swing, and this is how we explained it, and you can see the transition says how it was. Finally, it was interpreted and, interestingly enough, the last one shows that okay, this was actually what the customer wanted, but this is how we explained it.

B

The first one and- and here are all the other different interpretations for it now. The other aspect of reality is that things change right so, and sometimes uh change is very profound.

B

So, for example, before the 16th century, it was common knowledge that the earth was the center, was it the center and there were pretty complex uh maps to track how the different planets moved and all that stuff coppernickers in 16th century, presented a different view, and this was the heliocentric or the sun-centric model, and that changed our perception of reality of how we thought about uh earth- and uh you know other planets.

B

Now, interestingly, copper. Nickus was very reluctant to present this view because it was so different that he was afraid. People would think that he was crazy, but it was a good thing that it came out eventually now going back to how we model information of uh or how we capture the data uh with regard to the reality around us. So we saw this. uh We saw this uh description. 60 years old, male patient, presented with certain onset of chest, pain, nausea, sweating and listening.

B

So this is what a sample information model will look like right, you're, taking the history information, so you've got a number of things that are there and, if you now add that add the instance data to it, you'll say: okay, chest pain, a patient has chest, pain, dyspnea, sweating and nausea, but and does not have vomiting cough and epigastric pain.

B

So this is a very simple model. It is uh basically a list of terms and uh that are there in the in the in in our database, in the data model, there's no additional meaning that is around given around chest: pain, epigastric, cough epigastric, pain, cough and dyspnea. For the most part, this is how it looks like right now there may be definitions, but oftentimes. It depends upon who is seeing this list and how it is integrated.

B

Now an ontology model, on the other hand, is a very rich network of uh of relationships. It's a web of relationships, so it if you see, if you take the same information or similar information right, you have uh what I'm showing here, is anatomical aspects and uh and the concepts related to pain.

B

So you have abdomen and chest as two anatomical regions of the body. Epigastrium is a part of the abdomen and sternum is a part of the chest.

B

Pain is subclassed into abdominal pain and chest pain, so abdominal pain is a type of pain. Chest pain is a type of pain and now abdominal pain itself is uh subclassed as epigastric pain, and there are other regions we can put here, but this is the only one that I I'm showing here. uh Similarly, uh chest pain has uh ischemic chest pain, periodic chest, pain, sternal, chest pain depending upon what area and what type of pain it is now.

B

Epigastric pain is a type of abdominal chest pain and has cite the epigastrium, just as abdominal pain is a type of pain and has side abdomen. So you see the uh the part of uh structure here and the iso structure here they go together. Similarly, over here chest pain has side, chest and then sternal pain is a type of chest, pain and has site. Sternum and sternum is a part of the chest.

B

So if you look, if you connect these two things, you'll see that epigastric pain is a type of uh you know. It's it's connected here in the ontology model. Now the good thing about an ontology model is that this graph representation is actually pretty easily understood and you can actually tell a story from it and, in addition to just the graph representation, each term has a textual definition to tell us to give us what to tell us what it means.

B

So now, if you look at the models, there is a there is a spectrum of different kinds of models from simple to complex on the left side, the simple some of the simple ones you can have lists like you, can pretty much. You can throw anything in a list.

B

uh Control vocabularies are kind of lists which bring together some uh specific terms which you want to deal with together. Taxonomy is uh more a little bit more structured in which you have a hierarchical representation of the terms.

B

So in this case an example of cardiovascular system disease can be uh has of two types: heart disease and vascular disease, and then uh again, heart disease can be of different types: uh microstenosis, aortic, regurgitation, hypertension, peripheral artery disease and things so, hypertension, peripheral artery disease or vascular diseases, uh mitral stenosis and regurgitation are heart diseases.

B

So so you have a taxonomy here which represent a hierarchical structure of diseases, and the most complex of this is the ontological structure where ontologies represent not just the the taxonomic structure, which is in the in the form of the is a relationship here, but also it has other relationships like uh here. The part of relationship uh partnering in for anatomical structures and other relations, like has site uh you know, abdominal pain, has site abdomen. Epigastric pain has side epigastrium things like that.

B

So this is a rich web of relationships that we get we can use with. With our information model switching gears a little bit. I want to talk a little bit about uh the place of ontology, so what uh you know how it fits into the bigger picture, so ontology is one of the specifications that is the part of the semantic web stack.

B

So if you see this middle structure in white, you can see that at the bottom is the rdf, which is the foundational layer for the semantic web, and then you have some rdfs, which is the rdf schema language, and then you have owl, which is more expressive and uh in and allows us to specify some rules, and then there is the sparkle language, which is basically the query language uh that is used for semantic web.

B

So, looking at the first one right, the foundation, one is the resource description framework. Now the the basic fundamental part of the resource description framework is this. Subject. Predicate object, data structure and uh these and resources every resource in uh in semantic web is described using these this data structure subject predicate and object data structure. Basically these are statements and- and these triples are basically they are the fundamental structure of rdf.

B

Let's see how this looks like in practice. So if you take this triple structure, subject, predicate object and here is some instance- data that can that can be represented, so bob lives in houston. That is one triple bob knows: lisa, that's a different triple and lisa likes reading.

B

So here, in this case, we are talking about bob who lived, and there are two relations lives in and nose and and then there are this houston that is related to the lives in predicate with bob, and then we are also we're talking about different kinds of concepts here right and if you look at it uh each one of these uh entities, the subject and the predicates, and all that they are identified using a uniform resource identifier or a uri, they uniquely identify entities and- uh and if and when it comes to urls, these can often be resolved.

B

So, for example, houston can be represented as dbpedia.org page houston, and if you go to this web website, you can see that it resolves into a description of what houston is now this more you can uh that rdf provides. It also provides an ability for us to tell what kind of thing bobbies or lisa is, or basically an entity is so we can represent. Bob is a type person, lisa is of type person and houston is of type city, and these also get their own uris right.

B

So in this case, we're using the fourth person uri, which is four stands for friend of a friend and that's a model, and in this case we're using the semantic science ontology to describe what a city is and houston is an instance of the city.

B

So what about reading well is it is? Are we talking about reading as an as an activity that lisa likes or reading the city in uk that she likes well? This is this is what we were talking about. Ambiguity previously, and I'll talk a little bit more about this later.

B

uh Just to summarize this part of it, uh the semantic web uh has different parts. The rdf is a standard model for data interchange and triples form. The fundamental data structure of the rdf, rdf, schema or rdfs is the schema language that allows us to describe the instance data that we represent in rdf. So it allows us to define classes and subclass structures like a person is a is a subclass of organism or.

B

Things like reading is a subclass of activity, and things like that web, ontology, language or owl is provides us additional functionality for describing rules and things like that for authoring, ontologies and finally, sparkle is the query language that we use for for querying rdf data and, lastly, serialization formats, rdf data, rdf, uh rdfs and web, and all all three of them can be serialized into different into into different formats, like n triples turtle, rdf xml or even json ld.

B

So these are all text formats, and the important thing to remember is despite these different formats, that these uh that that ontology or an rdf data is serialized. They all mean the same thing.

B

So let's go to the let so I want to give you a little bit of a glimpse into what this actually means in practice. So let's look at a demo of this.

B

So what I did here is uh I have a small uh ontology file that I uh that I've created for demonstrating some of the different aspects, and so I'm going to show two different things. One is at the class level and the other is at the instance level. So if you, if you look at things represented at the top level, body structures, clinical findings, drug products, uh geolocations substance and a value partition, so the body structure is pretty simple right.

B

It's a straightforward data structure, so we have abdomen chest epigastrium, pleura and external, and if you look at the, if you look at the definition uh for us, for example, for epigastrium, we say it's a part of abdomen and sternum is a part of the chest.

B

Chest is uh what we've defined here and similarly I've in terms of uh clinical, finding I've defined pain and some subclasses of pain, abdominal pain, chest pain, epigastric pain, ischemic pain, periodic pain and external pain. So, if you look at sternal pain, you can see the definition has site sternum, it's a subclass of pain and has site sternum.

B

Epigastric pain is a is a subclass of pain and has cite the epigastrium. Now look what happens when I classify this, so you can see that this is all uh you know.

B

It's a it's a simple list here, simple list here now, when I classify this and I look at the classified hierarchy, you can see that now the what was a previously a simple list, abdominal pain, chest pain, epigastric, pain and all this now you can see that they have it's become a hierarchy, so epigastric pain is a type of abdominal pain, ischemic, pain, periodic pain and stomach pain are types of chest, pain and that's because we describe them in the we described ischemic pain as side chest, so it became a subclass of chest pain.

B

uh Now when we go to the geo ontology, I want to show you this, so this is this was at the class level, and I want to show you at the instance level what happens.

B

So we have here different, you know: I've represented a city and capital city, capital city is defined as capital of some country or state uh defined a continent, and this is a defined class. So it has, uh you know you define the different continents, and so I have some countries. Have some regions have some states and things like that now, if you look down here, you can see some of these instances that I have defined like some cities, london, hyderabad, paris, oslo and all.

B

So if you go into the definition of this, what I'll do is I'll turn off these instances? uh Inferences and we'll see here so london is defined as a city and has a property capital of the united kingdom. Hyderabad is defined as a city and is and has a property capital of telangana uh paris, similarly, capital of france.

B

Now once I classify this see what happens if I so now previously, what we had we had london, uh we only asserted city as uh it's it's of type city. Now, because of this association as a capital of united kingdom, we can. We now can say that it is a capital city and it is a part of europe.

B

Similarly, you can see for other things. You can see that hyderabad again has been classified as a capital city, even though we didn't say it is a capital city uh or of type capital city, and also it is a part of asia. These are some of the inferences that we've drawn because we said, if I go to country you can see here.

B

India is, india, has part, hyderabad has part and pradesh and such things so there there are a lot of india as a part of asia is what I uh declared, but here are all other things that it has inferred everything in yellow.

B

So I hope this gives you a little bit of an uh an idea of what you can do. The kind of things you can do with uh with a with an ontology going back to the presentation.

B

uh I want to look at uh you know: go quickly, go through some ontologies in healthcare. So if you uh look at the proper definition for a diseased term, something like acromegaly, this is what it would look like. You know we represent its clinical findings like acromegaly is a disease which has uh which presents with large jaw, large hands and feet joint pains, excessive height, and all that and then you it.

B

It is a type of gigantism, which is a type of hyperpituitarism that has a location, uh pituitary and a part of the pituitary is adeno.

B

Hypophysis and acromegaly has location, editing hypothesis, so you can see that you know this is a very good uh graph of graphical representation of what this of this model of of a model of what the disease looks like we- and we do this by using a number of reference ontologies like snomed, which is a very large model uh covering all of medicine, with about 300, 000 terms, uh foundational model of anatomy, uh with about 70, 000 terms, gene ontology with about 30 000 terms and then ontology for general medical science, infectious disease, ontology, and so on now reference ontologies are they represent domain models?

B

The benefits are that they are carefully curated consensus based and they often have formal definitions. Both english and logical. So here is a definition in english of what a disease means a disease is a disposition to undergo pathological processes that exists in an organism because of one or more disorders in that organism.

B

The the and you'll see that all most of the terms that are there in the ontology will have very clear and consensus-based uh descriptions like this. The disadvantage is that most of the reference ontologies are at a conceptual level and they're very large, and they can overwhelm software systems.

B

This is where application ontologies come in, which, where we're dealing with specific needs, for example uh in one of the projects that I did semantic db, this was in the cardiovascular space dealing with ecg cath uh echo and surgical procedures. Physiomon physiomimi was in the sleep domain and we were dealing with polysomnography data and then we have other needs like user interface, needs for data entry and etc measurement units and even semantic search applications.

B

So do we build custom, ontology models for this yeah? Absolutely, we need custom models for this, but at the same time we have to remember that interoperability is still a key.

B

So just to summarize this right, so we have a to build an application ontology, the scope of which is, you know, is uh we take the use cases which explains what the domain is? That gives us a top down view.

B

uh We look at the content that is available, the data that is available and it provides us the bottom up view and we look at other needs like, for example, the user interface needs, and we use some modeling principles like reuse, modularization and frameworks to do this um so I'll skip this one. uh This is you know the previous one was looking at the clinical use case I talked I spoke about. You know from a top-down approach.

B

We can look at what kind of questions that we want answered from a bottom up, some term lists like gender age and all and this between the two of them. It gives us prioritization of how we should approach the modeling technically again, the user interface requirements like visual query and search and things like that: complex definitions, as well as calculations and derived values.

B

So the goal is that we need to build an ontology that is customized to meet the application, needs but reuse existing models and lead to interoperability of data, and this is where ontology patterns come in use and I'll show you with the few use cases.

B

So the first approach is reuse right. The goal is, we need to reuse existing models and in a custom situation to solve the problem, but the reference models are pretty large and we, what we do is we use segmentation algorithms to create custom modules of this, and thus this results in custom models that are useful for our uh individual situations.

B

The thing is the more and more custom models we use right. It can again lead to fragmentation, and this is where frameworks come into picture is with the goal to avoid fragmentation, and so for this we use uh frameworks like basic, formal ontology. This is one of them, there's other ones like biotop and all, uh and then there are some principles that are laid out by the oboe foundry, which, uh like reuse and modularization and frameworks, and all that that we use uh the reason why we do.

B

This is because models that have that are common are interoperable and that interoperability in the model level really leads to interoperability of data.

B

So here is an example of the sleep domain ontology that we built using basic, formal, ontology, biotop uh clinical, patient record ontology ontology for general medicine, medical science, nemo for a neurology, ontology anatomy, and then some special ones for units and drugs now so this here is here is the first use case we look at this is the physio from the physiomimi project, and so what we built is what we had. uh What we did was we.

B

uh We were accessing data from about six or six plus data stores in in uh four different uh states in uh uh sorry, in four different institutions, three different states um and uh and and uh all of them built individually and separately right. So there was none of the data. Schemas were matching and uh we needed to have. We need to build a platform whereby we could query the data uh across these four institutions.

B

So let's see how we did this so the first thing in the data we looked at uh were the units right. We have units things like age, represented in years height in centimeters, weight in pounds and hypopnea, which has uh you know, 30, and you know seconds time, duration, percentage and things like that. So we have units both in si as well as english.

B

So what we did was we took a look at what kind of ontologies were available for units right, so that was uh so. We had a couple of different models. We could use patto and uh measurement units ontology, but our analysis showed that it was. They were not adequate from a coverage point of view and also they were not expressive enough. So we were forced to build a custom ontology for units.

B

So here you can see we defined both the si units, as well as the english units and the translation between the two, and this is where the the formula came into place. The formulae for things like body mass index right so body mass index is weight in kilograms by height in meter square.

B

We also needed to have calculations like kilograms to pounds inches to centimeters. The problem we had was owl or the web.

B

Ontology language does not support this, so we we overcame this by using annotate formula annotations where we represented the formula in as part of the definition as in in the annotation and from this we uh you know- and we did like foot to inch units of the same type kilogram to gram as well as conversions between uh english units and si units and based on these formulae, we uh we did an inferential expansion so that we could uh we could get conversion between any unit to any unit that that were related now.

B

The second uh challenge that we had to solve was complex definitions.

B

So here is the uh something uh you know I showed this previously sleep, hypopnea, finding and american academy of sleep medicine has this definition, air flow reduction, greater than or equal to, 30 percent of baseline and at least 10 seconds and hemoglobin desaturation, four percent from uh greater than four percent from baseline and you can see here- is the operational definition in out now.

B

This is a complex definition, but that's wasn't enough right, so we have to deal with overlapping definitions as well, because we had two different definitions from asm.

B

Now, the and like this we had about uh ten different uh definitions that we found. You know like sleep, uh hypotenuse, finding for adults, hypotenuse finding for child and asm one asm2 chat and all that and, as you can see here, we we we defined each one of them, uh basically in us as a single, simple list under the named hypothetical finding under the hypotenuse finding and look what happens when you classify it.

B

We infer the hierarchy here, so you can see hypotenuse. Finding adults has these subclasses the asm definitions that we previously had now become subclasses of hypopnea, finding adult, and you can see the hypopnea finding for sac.

B

Has these subclasses chat and asm definition too. So the third thing that we did was use the ontology for user interfaces, where we looked at the different kinds of data, continuous data, categorical data and boolean. So continuous data like uh age is, uh you know, is uh provided as a uh with with this formal definition right using um the the measurement unit being uh being a year. It's a float with uh different ranges right and then for categorical.

B

Similarly, we define in this case, if you can see race, we defined all these different races that uh follow uh us, categorical values and when we um and when we uh generate the user interface from this based on the on these kind of patterns, we can we draw these kind of widgets. So if you pick age at the time uh uh at the time, in this case, we picked age.

B

At the time of study, you can see how a widget is drawn, where you can set the age limits and then for a categorical kind of values you can see you can. You can represent this as as check boxes now going back to uh this picture right and so in the user interface. You can see that we have a number of terms that you can see here.

B

So we have a search interface where, if you as you as you type in the words, uh letters, um the the terms, start filtering and these terms came directly from the sleep domain ontology.

B

uh So so, how did we? How did we connect these so between these data stores that are there at different institutions?

B

We we placed an adapter that provided the mapping layer between the ontology and each one of these data stores and uh that that provide analysis and then using ontology. We generated this uh user interface and for the queries that were that were described using this in user interface, we created an abstract right based on what the user wanted, and that was sent over to each one of these data stores and using the mapping layer translated into the individual sql.

B

Language, the query ran, and then the results that were brought back were again translated back into the ontology terms and displayed in the user interface. So this was the this was uh one use case. The second use case is uh that we did was a clinical key or a semantic search interface, where we were dealing with about 500 plus journals, 700 plus textbooks, uh which we put through an nlp pipeline for information extraction, and we used uh emmett or lcvs medical taxonomy to as a core of the uh ontology model for this nlp pipeline.

B

So this was a project that I did it uh when I was at elsevier now. The results of this nlp pipeline, where uh you know, was in two forms: one was as a link data repository which provided uh which, for which we had a sparkle interface, and we could do interactive queries and the other one was a database with apache solar and for which we were with which we could do a full text search.

B

uh So this was this. This was a product that was developed and uh uh and deployed into the market, the clinical key search interface. uh Okay. So with that, I want to conclude this by just going over the role of an ontology in uh in describing reality. So, first, uh it is useful as a formal representation and is computable.

B

It supports the reasoning.

B

It forms the basis for an information model, it aids query formulation using sparkle uh and we saw how we can integrate multiple distributed data stores with it and and we can bring together different standard terminologies that provides uh data integration as well.

B

So I want to thank uh acknowledge some of my colleagues with whom I had worked on these projects: semantic dvd project, cuban clinic the physio mimi project at case western reserve university and uh the smart content uh project at elsevier.

B

Thank you. If you have any questions, I can take them.

A

Awesome, thank you very much. um I think you're all giving a virtual round of applause very interesting presentation. I am certainly um inspired by these ideas to because this is this is actually useful for a multiple of situations, not just healthcare.

A

We had some questions in the chat. I will group tom hicks's question: will gold rita's, I think they're going the same direction?

A

They were curious about the workflow of creating these ontologies if they are usually done manually by experts or or and or they when you do these things, you actually use nlp techniques and you extract and group keywords from from free text to create these.

B

So uh it's a mix of both so for to build to create good ontologies. It takes a lot of work right and it takes. uh uh You need to have good domain knowledge to uh to to come to the to represent the definitions. uh Well, and it's a very consensus-based approach. Also now does that mean that we don't use uh any other any tools and techniques to uh help us in this process?

B

No, we we do use uh some nlp tools like if you have a body of text that you want to analyze, to see what kind of uh terms that we want to represent in the ontology you can, you can do use use some um nlp and you can do like word frequencies, simple things. You know that tells us what are the most common things that are available there. uh That will help us from a bottom-up approach that I talked about in in understanding. What are the important uh terms?

B

The areas that we want to represent first now are those terms that we want to represent. Are they available in any ontology in any existing ontology? Because if something is uh if and if it's existing in an ontology, we want to be able to reuse it. A number of these ontologies are open source they're, openly available. uh Snomed, rxnorm, fma and link are some of the models that we use very often in uh in medicine.

B

So, and these are these are fairly uh flexible with their licenses, so we can use them, and so, if we don't want to reinvent the wheel right. So if if the first thing is search, look to see, if something covers it, if it is available, use it, and that is one of the basic principles of the semantic web of using ontologies.

B

If it is not available, then we will need to uh extend uh a framework or an existing ontology, because we are not creating this in isolation right. So, for example, covid recently covet was a good example. Right till recently we did not uh have a term or a disease coming out of uh you know uh from code. So what we did is we extended the infectious disease ontology and- and we extended, we subclassed sars virus to represent the coronavirus, the the sars cove two virus and uh similarly the disease is caused by by the coronavirus.

B

So we extended existing terms to cover this new area.

B

Does it answer the question.

C

Very interesting, it's dr arbandi tom hicks. um I I thank you very much nice presentation of uh introduction to uh to ontologies, but, but I think, of an oncology as a tool that is uh serves the purpose of achieving some higher goal and I'm curious to know what um what your your uses were for the oncology once you created it. For instance, in the first example, you gave the clinical notes.

C

uh How were you using ontology that you built was that was that the end result, the the model was what you wanted, or did you use it to further understand the clinical notes in some way.

B

So it's the latter right. You probably don't want other than a reference ontology which describes the domain. You probably don't want to build ontologies just for the sake of building ontologies. Unless, unless you love building ontologies, that's what you love.

D

For the most part,.

B

We build ontologies to solve a real problem, so the first example that I showed with the clinical note that was with work that I did at the cleveland clinic, and so we built a cardiovascular ontology for that and it was used to interpret the clinical text.

B

And uh so once we were able to interpret that clinical text. It was represented in the form of an rdf graph and and then the same uh ontology terms were used uh as as part of the query interface to query the rdf graph.

C

In that case, don't you sort of have a chicken egg problem because to understand the terms you need to build the ontology and then the ontology is necessary to understand the relationships of the terms that you're trying.

B

To extract, uh it is a little bit of a chickening the fortunately, we a lot of the reference ontologies provides us with very, very good domain coverage, and so a lot of uh terms are already there existing and we can leverage those.

B

But, as we are parsing the text, we are going to find terms that are not there and those. We definitely need to add. So it's not a it's, not a one-off process. Where you you look at some text, you you develop the ontology and you're done with it. So, as new text comes into play, um your or our our as our reality changes right, you're going to keep evolving the ontology.

A

Thank you. Thank you very much guys. um I know there are more questions and in the interest of time, let's push this to after the final presentations when you have um open discussion, I'll, bring back the questions that are not answered for now yep. Thank you very much sivaram. um We will now move on with representation by pierre gerardini with kendall, which is a very interesting platform. Very cool infrastructure, mixing closure, r and atomic pier the stage is yours.

E

Thank you, I'm trying to probably share my screen. Give me a second.

E

All right: do you guys see it? Okay,.

A

Yeah we got it. Okay,.

F

Yeah, so as a as uh as uh santiago's mentioned, this is um a real platform for doing biological data science that was built using a combination of closure, r and atomic, and uh before I go into the details of the platform, I work at the parker institute for cancer immunotherapy. So I'm going to give just uh a two seconds introduction to the institute. What is the main work that we're carrying out at the institute? Because we really built the platform to power our work, and so it's important to understand what our work is.

F

First, so the vision of the institute is really to transform the way that medical research is done and to in order to turn our cancer into curable diseases, and the institute itself is particularly focused on immunotherapy as a treatment modality, which is the idea of using uh the body's own immune system to attack cancer and then and eliminate cancer and the um the pisces model. So I'm gonna call it pisces, which is the algorithm that we use internally and actually in our law as well.

F

So when I say pisces, I'm in the parking issue for cancer immunotherapy, so the the model, the pisces model, is really built around collaboration and and the the institute was born as a as an alliance between some of the major cancer centers in the nation that you can see in this slide and- uh and um you know some of the major really leading researchers in these fields- and uh there is a central office in san francisco, which is uh where I work and where, where we have an informatics team, that, as part of this charge, is taking data from all these different sites and and both from clinical studies, as well as laboratory studies and laboratory research up and all these sites and trying to maximize the amount of information that we can extract from this data set uh in order to advance the field of cancer immunotherapy in general.

F

So um the uh as I say, data is really you know the fuel that powers, everything that we do and and in our research. What we're really trying to do is go to go from a bench discovery so something that uh one of the investigators discover into in.

F

In one of the labs, maybe in an animal model, move that to the bedside so translate this into a clinical trial that can be used to test this therapy in in actual human patients and then eventually, if the clinical trials, successful, move this therapies to the market, which is the way that you know commercialization is really the way that medicine go out in the general population. Besides that, besides a clinical trial, that's that's a very limited, uh obviously population, just for testing, and uh you know as part of this.

F

Sometimes we also want to what we want to go the opposite so we're going to. We want to go from bedside to bench, so we make a. We make an observation in a patient. We discover, for instance, that you know patients that don't respond to a certain therapy, have a very high level of expression of a given gene, and so we want to go back to the bench and trying to figure out the experiments that can explain his observation that we've made we're making patients, and so in all of these we use data extensively.

F

You know both molecular data from from our patient sample, as well as um clinical data and really candle is the engine that uses all this data to power all this work, and so it's really the core data infrastructure that is central to the work that we do at the institute.

F

So I've been, you know, up to now, I've been speaking genetically about data and wanting to integrate data and use data. But what the? What is the data? You know what this does?

F

It actually looks in practice in our specific example, so we are running a number of studies, uh an initial as an institute, a number of clinical trials, and in doing these clinical trials, we collect both tumor tissue from patient as well as blood, so we get biopsies and we get we get blood drugs from these patients and then we use a uh you know a large sweet uh collection of different molecular assays on this uh on these samples to get molecular measurements.

F

So, for instance, on the tumor, we can do whole exome, sequencing and figure out all the mutations that are in the tumor or we can do multi-parameter imaging and trying to you know, see the immune cells and the tumor cells and the relationship between the immune cells and the and the tumor cell in in the impact tissue, or we can do rna sequencing, which is essentially measuring the expression of different genes in the tumor and similarly on the blood. We can measure different type of of molecules that are that are in the serum.

F

Different types of proteins are in the serum that are important for, for, uh uh or you know, for the working of the immune system as well as we can profile the all the immune cells that are in the blood- and you know knowing these patients, how many cells of one type there are how many cells of the other type there are, and also what's the activity of all these different cells in the in in the blood and and then so.

F

All of this generates a large stream of molecular data, so molecular information about what's going on in the cells and and the genes of of this patient, and then we really marry that with the clinical information that we get, we get from our studies, which are uh more kind of more like the thing that might the previous speaker was talking about, which are information such as you know what is the type of cancer that this patient has uh you know what other comorbidities this patient is experiencing, what drug treatments he received? What was the response?

F

How long did he leave? How long did it pass before the disease recurred in the case of cancer, etc? So all the all, this type of clinical information once again very similar to what the previous speaker was talking about, and so our job and the job of this uh you know black actually great box here- is to bring these two together and trying to figure out if there is any in the molecular data that predicts any of this clinical feature right.

F

So, for instance, we might discover that patients that don't respond to a certain therapy have a very high expression of a given gene, or maybe they have a very low abundance of a given cell type in their blood and so on and so forth, and the idea of being able to to make this these discoveries is twofold.

F

So, first of all, we want to be able to give treatment to patients that work, and so when the patient comes at the door, if, if we see, if we do a molecular test, then we see that a certain treatment is not going to work on this patient. We don't want to give it to him and vice versa. If we have a library of treatment to choose, for, we want to make sure that we give them the treatment, that's appropriate for the molecular profile.

F

But that's one reason. The other reason is to further advance the field of clinical research right, because if we discover that the patients that don't uh uh respond to this therapy, they all have a very high expression of gene x. Then maybe the next step could be trying to figure out if you can develop a molecule that targets, gene x and so can be combined with that with existing treatment.

F

To uh also address the needs of this specification for patient population where, uh where the previous treatment wasn't working right, so the way that this we proceed is by making all of these observations impatient and then using this observation to uh sort of further advance the transcendental practice, because in every clinical study there is always some subset of patients for which the treatment works and some subset of patients from which for which the treatment doesn't work. And so the job is really teasing. The two apart using the the molecularity that we collect.

F

So what does the typical data set look like- and this is very important, because this is really the basis of the of the of the reason why we built the system, the system, the way that we built in and and and the users of the system.

F

So typically, we have a kind of like a small number of subjects in the study, so maybe, let's say 50 to 100 subjects and for for the for each one of this subject, we have several, you can think of it as a table, so spreadsheet of molecular measurement. That represents the results of one of this acid that we carried out on this subject. So, for instance, the gene expression assay right.

F

So in the case of gene expression assay, we get a big table that contains expression of you know: 20 000 different uh genes in the in the samples from this specification, and we have the similar thing for all the assets that we run. So we will have a big table for our gene expression assay. We will have a big table for our genome sequencing assay.

F

A big table for our imaging assets, etc, so very rich data set, but on a relatively small number of patients, because we're talking about 50 patients, we're not talking about 10 000 patients.

F

Well, the other. The other thing about the other feature of this is it's very deeply interrelated, because at the end of the day, when we do these molecular measurements, we're measuring a lot of things that are related to each other. So when we measure the abundance of the proteins in the blood, these proteins regulate the activities of the cells that we're also measuring in the blood right and when we measure uh the composition of of the the number of immune cells into a tumor.

F

Well, this immune cell come from the blood in the first place, so they migrate from the blood to the tumor, and so all of these different measures that we are that we're measuring are really kind of like looking at at the same biological system, the same new methodological system just from a lot of different angles, and so this data is very deeply interconnected and related, and the uh one of the the other uh important feature of this data is it's. It's typically sparse.

F

So because these are, you know, very sick patients that we're getting these samples from. Sometimes it won't be possible to have a tissue biopsies, and sometimes you know, somebody was scheduled to to get a blunt wrong blood, throne or given day, but he had to skip the blood draw and so we're not gonna have that sample from the for that patient. So the data is a little bit of a swiss cheese of what is what we can get from from uh from from these very sick patients, and and very often it's not complete.

F

Actually, it's never complete. Basically, so the thing that we like to say is this is this is not really big data, but it's deep data. So it's not big data, because it's not a ton of data. I mean this is not like facebook level. Big data on you know billions of users and hundreds of thousands of clicks every day, but it's sorry some it's a much smaller data set, but it's very deep because we're collecting a lot of lot of different features on on these data sets on on this patient.

F

So what we typically want to do with this data? This is something that I've already uh you know mentioned mentioned before. So I'm going to go quickly to this slide but, as I said, you know identify specific subset of patients to you know, figure out if they're going to benefit from the therapy or or or if I'm not going to benefit from from the therapy.

F

For instance, uh then you know another another thing we want to do is we determine if a certain observation has been made before right, because we are running lots of studies and other people in the field are running a lot lots of these studies, and so when we see that you know, gene x is over expressed in the people that do not respond to the therapy. We want to be able to see.

F

Has this thing ever been seen in another data set, so there is a component of sort of meta-analysis and going back to the existing data and querying for specific observations, and then you know. Last but not least, we want to use use all of these to build, as I said, predictive models that give us insight into the mechanism of action of therapy right. We want to see if gene x is important for the patients that do not respond to this therapy.

F

What is the mechanic, the biological mechanism that under underlies this, this importance and the reason why genexism is expressed so in order to so we have a team at the the parking institute that basically is doing all this kind of work for the studies that we're running as an institute. So we as an institute sponsor a number of clinical trials. We collect all this data set, we bring it in house and then there is a team, the informatics team of which I'm a member.

F

Obviously, that is tasked with doing all the work that that has developed now, so in order to facilitate and really power our work, we created this platform that we call candle that stands for cancer data and evidence library, and it's really a platform for biological data science uh uh in general. That supports a lot of different data types, and I'm gonna talk more about that in the in the next of the presentation of the subsequent slides. So the platform is really you know we can.

F

We can conceptualize it as it is in three steps, so we start with raw data. So we get. You know big massive data files from our sequencing vendor for our imaging vendors etc. So this is a you know, raw binary files that we need to associate with whatever whatever sample they came from, and at this point this association is very much unstructured, so you can think of this as a almost like an object database. We have all these objects. We associate the metadata, that's completely instruction, it's just what we get from from a vendor.

F

This raw data goes into a phase that we call primary analysis which, which is where we transform it in a much more much more compact set of useful features like, for instance, to give you an example, when you do genome sequencing, the files that you get from the machine are tens of gigabytes files that contain all the raw experimental information, but then distance of gigabytes content tensor gigabytes file.

F

They can they get summarized down to a very small uh spreadsheet, of maybe a few hundred k that contains all the variants that are in in a specific sample right. So there is a. There is a a step here where the data is taken from raw into a set of features that are much more distilled down in size and once again, this feature will be things. As you know, what is the abundance of a specific protein in the blood or what is the proportion of uh cells of a given type in the blood?

F

Or what mutation does this patient have etcetera, etcetera and so from then all the all this data, all these all this feature then go into the scandal database, which is a highly structured uh database, and it's really, uh you know the basis that we use for doing all the all the subsequent data science work.

F

So trying to answer all the questions I was talking about before we, we have all the data in the candle database and then we pull the data out of the database, and we do you, know, machine learning and exploratory statistics on this data, so the rest of this presentation and even though we are using closure and r and atomic across this entire infrastructure, uh the rest of this presentation is really going to be mostly focused on on on on this last piece here: the the candle database and how we built it and how we use it, how it's organized so, as I said, this database is really a platform for biological data science and and the core idea is to really break down the silos between different types of molecular and clinical data.

F

So, as I said up to now, we get a very broad variety of data sets and- and we want them to have them all in a single place, so that we can do queries and interrogations that really navigate this data very freely and and- and you know, without impediment in silos, so breaking down periods. Putting everything together was was a major design goal of this project.

F

Another very important thing that has to do with with the efficiency with which our team can work and work on the data was really enabling the whole team to make sure that they're working on the same data. So this is like a data version problem right. We have five different data scientists that are working on a specific trial.

F

We don't want all of them to have copies of spreadsheets on their computer and then you're never sure that they're really working on the actual same version of the data we want to have a centralized repository where, ultimately, that's probably version, and so everybody accessory can be 100, confident, they're. Looking at the same data, the other thing that was really important about you know. That's still related to the efficiency of the team. Is this idea of making analysis code reusable across projects right?

F

So what happens typically in when this work is done in an academic environment where this sort of infrastructure does not exist? Is that somebody will have a spreadsheet on it on his laptop for a given project, he will write a whole script that does a whole complicated analysis and then the next project comes along and that now the files are made completely different.

F

The columns are in the spreadsheets are completely different, and so the code doesn't work anymore, because it's been, it's been complected that the the the shape of the code is an amalgamation of what analysis has been done, but also the specific way that the data was organizing, that specific project instead by having a consistent organization, a consistent data interface across all of our projects.

F

Now the code that we write is really is really reusable, because if I uh you know, if I write a script to do a certain horizon or given a given study, when I move to a different study, I know that the data is always in the same shape, because the shape of the leader is dictated by the data model of the of the of the database, and so my code becomes much more usable across projects and across the members, and uh you know last but not least, was this idea which really leverages atomic for those of you that are familiar with it, which is the really keep a history of the data to make sure that we are always able to reproduce our results right.

F

So we want to be able if two years from now somebody comes along and says. Oh, I want to reproduce the same plot that you produced two years ago. We want to be able to be able to do what I just described. That really requires a system for data versioning that that is, that is very granular, because otherwise there will be no no way for me to go back to the state of the deal two years ago and be able to do this if I didn't have a specific system for it.

F

So in answering all of this, all of this- um let's say use case and and needs our approach was really to leverage closure and atomic unique. You know: data modeling and processing, primitives upstream of thing that standard data science tools. So what we're doing here is not replacing the whole data science stuck with closure and atomic.

F

It's a it's leveraging, closure and atomic for the for an area that we think it's very well suited for that has to do with the data regularization data processing, while at the same time, building a bridge for for the standard data science uh workflow, which in our field, which is that of computational biology, means working in r okay.

F

So we're not trying to use quotient atomic for everything where we're we're using closure, entertainment for what we think is really good at and then we're building a bridge to r for for, uh for all these, the scenarios where r is actually a better suited environment so why we choose the atomic specifically is for a number of reasons. One, and I think one of the most important one is that the schema is malleable to change. So the the problem of biology is that it's an exceedingly complicated field and also it's continuously evolving.

F

So if we, if you, if you use a database technology which is very, you know, inflexible where you have to get basically the schema right at the get-go and then you're stuck with it, because it's very hard to change, you are really in for some trouble, because you're never ever going to be able to anticipate what. If the data is going to look like two years from now, let alone 20 years from now so having a schema that was malleable to change was extremely important and one of the main reasons why we choose the atomic.

F

um The other one was this idea that atomic and I'm going to talk about this in a second for those of you that are not super familiar with the atomic itself, but atomics really treat time as a first class concept, which is very important for analytical reproducibility.

F

So yes, this concept of the history of the data and being able to access the whole history of the data, which is very important for the reproducibility goal that I mentioned uh in the review slide uh the last you know another important one is performance. Obviously, so the performance that we get from the system is absolutely uh great for the for the kind of work that we're doing um the and and and the last two are more technical aspects, while one one has to do with the expressive of the query language.

F

So the query language of the atomic is is actually it's very, very simple, but but very expressive. I don't have time to go into it, but I, if you're not familiar with it, I will I will. uh You know, suggest all you look into it, because it's it's very simple and very elegant. At the same time very powerful- and it's called data log- it's really a very relative of prolog- and you know.

F

Last but not least, the economics of the api I mean working with the atomic api is really is really nice, and you know that was a huge help for the for dev velocity, so the basics of the atomic. I have a slide here just to give you at the basics of the the atomic information model, for those of you that are not familiar, and actually the previous talk was the perfect.

F

uh You know background tool of this, because the atomic is really based on a lot of the same concept that the previous figure talked about. So everything in atomic is is modeled as a as datums, which are really doubles.

F

Okay and each datum or each tuple represents a single fact about an entity similar to what happens with rdf, which is exactly the the figure.

F

So this is this table here, represents uh you know, a collection of doubles or thetums that show you a little bit of the structure. So we have an entity id that is used to identify entity and especially to identify tuples that refer to the same entity.

F

Then we have an attribute that represents what we are saying about this specific uh entity and then a value for this amp. So, for instance, the first tuple tells me that the uh the entity, one two three- is a subject with id one, two three five x and then the same entity is a subject that has disease head and neck cancer, and then the same entity has been subjected to a therapy called 789.

F

So now 789 would be the id of another another entry in the system, and so the entity, eight nine and this other entity, which is a a therapy with the name kituda, okay. So the way the way that is- and this is the way that the model relationship between entities and atomic by having by having attributes that represent the value whose value represents the idea of another entity. And so you can create a you know, a graph essentially of all the different entities.

F

So um it's important to note here in you know the the attributes come from a schema, so the attributes cannot be anything. The attributes have to be defined in a schema, but it's very easy. If you want to add another attribute, you just add it to the schema and you can use it from now on without having to modify the existing data, and the other thing that's important is that attribute can be named space here, as you can see, and that allows you to sort of define a concept a little bit of entity.

F

So even though the atomic doesn't know doesn't have a concept of entity per se, so there isn't such thing as a subject entity or a therapy entity. We can model the same thing using attribute namespaces, essentially, and so last but not least, as I was saying, the the time is actually first class concept in the tommy, so the actual uh table in the database looks something more like this, which is.

F

There are two additional fields that are added, which uh represent the time that this uh this fact was inserted in the database and also whether this boolean, that whether this is this was an assertion of a retraction for a fact, and these two, these two two different things, basically allow you to have the full history of the database. Okay, so you can always go back and say hey.

F

I want to do a query of this database as it was two years ago, and that would return me exactly the same result that I got two years ago and that because of the because of the way that time is treated uh by the atomic system, so we we get all of this history for free by using the timing. Just by extending these two. You know just by the fact that the atomic extent this notion of rtf apple with this additional uh time and then.

F

Sorry, okay, so most of the. So this is the the core of the system. So most of the of the of our work as a as as developers has been, you know how to like how to facilitate data into the system and how to facilitate using data and so getting that out of the system and using it. So in the in the next few slides I'm going to talk a little bit about how do we? The system will be to for getting data into the atomic and the reason why we built a specific thing here.

F

A specific system or specific infrastructure is because we want our data scientists to be able to import data into this database right. So our data scientists, don't know closure, don't know the atomic, don't know any of the internals of the database, but still they need to be able to take an existing dataset and put into the system. And so, in order to um to facilitate this, we developed this tool that called prep that stands for programmable etl and it's really an a configurable etl for getting data into the atomic.

F

So I'm going to walk you through this business slide uh here in a in a in a few minutes. So basically we start from source data, and you can you can imagine. This is a collection of tsp files, okay or csv, or a spreadsheet or whatever- and this would be- you know a typical spreadsheet.

F

So what the user needs to do here. The only thing that the user needs to do to import this into atomic is to write a configuration file, which is an eden file that specifies how essentially, the columns and the files in this dataset maps to the attribute in the schema. So, for instance, what is that?

F

What this very uh little snippet here is telling me is that the you know the barcode column matches to the to the database attribute sample id and that the participant column in this file maps maps to the to the subject attribute in the in in the schema right. So this is basically establishing a connection between the column, adders and the attributes in the schema.

F

In this specific example, obviously there is a little bit more than that, because we also have reference between the different files etc. But this this should be. You know enough to provide you a flavor of what this is so uh so that the user writes a configuration file doesn't write any code.

F

They just write this configuration file and then the tool that we wrote- prepped, which is a closure command line tool, takes the the source data, the configuration file and the knowledge of the atomic schema and metamodel, which is uh some additional things that we built into into the atomic schema. But let's say the atomic schema for now, so it takes all of these things and then prepare transaction prepares transaction data that can really be imported into the input database.

F

So then, in a subsequent comment, uh the transact command print will take all of this transaction data and put it into a database so taking care of like database ml transaction with rise pop up all of that stuff. So the mechanic of putting the data from from a set of flat files into a database and then the last step is that performing validation, which is very important and so validation includes, like validation of scalar attributes.

F

Like you know, uh a percentage can only be a positive number, for instance, then referential integrity, so making sure that all references are correct. So if I have a measurement that targets a specific sample with the temple bar code, one two three, then I need in my sample files. I need to have decided. I need to have defined sample one two three for this reference to be valid and also uh stuff. That has to do with attribute composition, so we cut which combinations of attributes are valid for for specific entities.

F

So this step is really important, because the data that one the process of putting the data into the database really does a ton of qc and standardization of the data which, which is very important and, for instance, as part of this. um As part of this import process, we standardize a lot of the data using existing ontologies.

F

So uh you know there are ontologies that describe the name of different genes, the name of different proteins, the name of different drugs, etc, etc, and, as part of this import and validation process, we make sure that everything that needs to be validated onto ontology has been actually uh validated on the relevant quality and mapped to the right ontology.

F

F

I'm gonna maybe skip this because I don't want to go over time and that's that's not really that important, so the the um the the the besides this system, we also built a sensory assistant to do a branch and merge workflow for for data right because, as I said before, we want user to be able to import their own data set. But at the same time we don't want user to just start dumping stuff into into into the production database.

F

And so we built a system whereby user can request a copy of the master database in order to work on the import of the specific data set. So on this branch data database they're free to mess it up or it doesn't matter if they import broken data. Or you know in the process of iterating over this dataset they're going for program data multiple times. It doesn't really matter because it's happening in a copy of the database and then there's a system whereby we can put.

F

We can put a data back on master sort of like a commit back on master pending administrator approval, and so um the these commits to master can be can be. Two things can be either entirely new data sets, or it can also be deep updates of existing that are already in the in the system.

F

So maybe this data set a was already an old version of the dataset was already in the system, but somebody has come up with an updated version of the data set so where the process becomes instead of determining what are the differences between the new version and the old versions in the database and and making sure that we reconciled all these differences.

F

So we we like to think of this as a as I said, like a branch and merge workflow for data, so then you know the data is all in there, and now we have to you know we're building all of these for data scientists to you to you to use. So we we, we have to meet them where they live, and the data scientists, at least in our team- and, I would say, in the majority of computational biology, delivering r.

F

So we we built a lot of infrastructure to make sure that the data can seamlessly go from from the way it is in the atomic to a native r object that can be used for downstream analysis.

F

So we accept queries obviously over over over the wire and the query. Sources can be either our library, which is what what our user use. So our user use issue query through on our library or also a visual query, building environment that I think has been it's called enflame and I think, has been presented by mike travis from our team in a previous meetup uh and then the the uh the data is really.

F

um These queries are serialized to json, and so we have a data.json parser on the other end that accept the query and transform it from json into something that the atomic can understand and then also a you know, a system to automatically improve queries. I'm not I'm not really going to talk about these two aspects today, because first of all we'll talk about it before, but also I want to focus on on the r functionality, which is more important for the data science part.

F

So, as I say, data log queries can be issues from r, and so we essentially wrote a dsl in r that mirrors the way that that query is looking enclosure right. So this is the way that you would do an atomic query enclosure. This is an example from the musicbrainz database.

F

Instead, closure query will look this way in r. So, as you can see, it's basically a very simple. uh You know substitution of certain certain. um You know syntactic element, but it's basically it. It looks the same way, and so we have a with this. uh You know this explains that the transformation that we had to do we we're going from the from the uh closure syntax to the r syntax, but it's really a one-to-one mapping and I'm going to explain in a second why we had to do it.

F

So basically, you can write a query like this in r looking very similar to closure and then what you get in r is a native r object that you can use with all the functions that exist in r, so you want to use r for plotting. You use you issue a query like this. You get back in your our session. You get back a native r object and then you use it for downstream plotting, as you would with any other native r object.

F

But the other interesting thing is that data logging queries in our data exactly the same way that their enclosure. This really enables composition and programmatic uh programmatic composition of queries right. So this is uh this is something that allows us to to. You know as developers to really build our queries that are that are built programmably, so, for instance, to give an example. Here we have, we have a situation where we have.

F

We have an existing query which we're calling q here and- and we want to add a bunch of different clauses to this query based on a given parameter in input right, and so we have. You have this function here c query that allows to take an existing query and add additional clauses to it, and these clauses are selected based on some other logic, and so this is very useful.

F

You know it's very useful, because a very useful consequence of the fact that our queries are data structure, the same way that they are in closure, it would be something would be much more complicated to do if queries were instead strings and you have to do a ton of string interpolation right. We can instead manipulate queries as data, so this dsl really takes advantage of earth lisp origin. So I don't know how familiar you guys are.

F

We are, but our was really a lisp initially and- and you know it takes advantage of the fact that symbols and expression can really be captured before evaluation and manipulated. uh You know one. One limitation is that you're still concerned by the fact that the expression need to be valid are syntax, and so that's the reason why we have to do some transformation of the syntax from from closure to r, because you still need to end up with uh with valid r syntax.

F

uh You know the pool syntax that you have in the atomic is also available. So the same way, you can do a pool query in the atomic. You can do a full query with our our library, with this syntax transformation.

F

So, at the end of the day this, what we have here is really what I give you a very quick tour of is really we call it the candle universe, because it's a whole set of packages and tools that revolve around this kind of database so that the database itself is built atomic.

F

We have this spread package that I was saying is the programmable etl pipeline, that's written in closure and that uses talksaw to do an additional system. It's called candelabra that contains all the machinery for doing the branch and merge workflow access composer said before so this is the way the data gets into the system, and then data gets out of the system.

F

By having you know, there's this closure server called lantern, which is basically accepts the queries over the wire and returns the queries over the wire, and then we have a bunch of our packages that we've built that essentially talk to this to this lantern system and allows the user not only the user, to write queries in r and the way I was describing up to now, but also contain a lot of pre-hand functionality right because at the end of the day, a lot of the time our users don't even write queries.

F

They just say you know, give me all the g. We have a function, a pre-baked function, a prepaid query in our library. That says give me all the gene expression measurements for this data set. So the user actually just called that, and so it's exposed to them as an r library. They get the data back and then they do whatever analysis they want, and on top of that, we've also built a lot of uh we're, starting to be a lot of tools for doing interactive data exploration and visualization right.

F

So we you can build dashboards and we just that talk to the database and so allow somebody that doesn't doesn't know r but knows how to click around to uh you know, uh generate plots and visualization from the same system. Using the same.

F

You know our enclosure infrastructure all the way down, so some elements of some general purpose emblem of this system with open source, and so, if you go on our github, you can you can see some some some of these things uh and you know get a feeling for this idea.

F

uh This idea yourself uh so last slide is acknowledgement, so really want to acknowledge man uh lasik each from our team and ben campos, which have been uh you know uh opposite partner in building and building on the system up and- and you know, absolutely essential contributors uh and then as well as a particular george kirsten from the continent team, which has also been a huge help in implementing and with that, I'm only six meters over time. So I'll take I'll, take questions and thank you for listening.

A

Awesome. Thank you very much. um This is really really a cool topic. I was just talking with danielle before about something similar. I really like this project and like a again virtual virtual clapping to you. We had lots of activity in the chat.

A

um I think some of the questions kind of solve themselves, but jordan is asking if it was straight forward for the users living and working with r to learn this. The atomic dsl.

F

um So if you're talking about the query language, I mean it's not yes, we, if you're talking about this. Yes, it wasn't too complicated, but I must say that the reality is that we've built already so much functionality that our users very rarely go down to the level of writing. Custom queries so a lot of the time. So I'm showing you what is that?

F

What is like the nuts and bolts of how this thing works, but the reality is that most of the time, users just just use the functionality they were built on top already, but yeah I mean have we had users picking this up, I would say that the the the the probably what what what users have had to uh more learning to do was was uh was learning how to use spread. But you know, I think this is still uh it's still. uh It's still less than I would have have to learn.

F

If we had to learn how to you, do closure and atomic to do transactions and and everything, so you know one. One thing that I like about this system is that we have had, and we've literally had high schoolers join our team and in a couple of weeks as inter during the summer in a couple of weeks, they were able to use the system to import data, because the only thing that you really need to understand here is the data and maps to the schema. You don't need to understand the atomics internal.

A

Well, yeah, that's yeah, that's really cool! To hear I mean if you can get to that level of um quick activity and productivity, even with high school kids, and really speaks to both the quality of the atomic and then the the interface with r. This is really interesting.

A

um I was I was curious because there was some conversations about this in the chat.

A

If you could comment on the the importance of reproducibility in this system, I feel it's something that I personally, for example, have to explain to developers that work with me that are not data scientists and I feel it's a very core principle of data science in general.

F

Yeah, so yes, I can. I can certainly explain that so so you know, if you, if you do a little bit of google searching, you will see, you will see people decrying the crisis of reproducibility in science right so in general, in science. There is this this this problem in biological sciences, especially of getting reproducible results so being able to you know I do an analysis, then I give you.

F

I give it a bit of somebody else and he's able to the same idea or she is able to determine so it's all reproducibility problem as a whole host of reasons, so some of some of which are solvable in software and some of which are not so bubble solvable in software.

F

But what we wanted here was to at least try the stuff, that's under control, so the stuff that's solvable in software to solve it as much as possible right and so obviously, when you have a piece of the analysis in order to have a reproducible result, you need to have you know the code need to be version obviously, but we have kit for that. So that's that's very easy, then, to reprodu the environment in which the code is run needs to be version, and you can use docker for that.

F

So it's very easy to have a reproducible environment. The piece that really that really was missing, at least for us, was having reproducible data so being able to know for a fact that the data has not changed. It was two years ago.

F

So that's that's what what where our system comes in and the reason why you want why you want to do this is because you know this data is exceedingly complicated and errors are made all the time in the analysis, and so it's very important both for you know for auditing for being able to answer answer. You know questions from other people in the field that you are always able to at least you know get the same result.

F

Maybe it was the wrong result in the first place, but at least at least you can you can get you can get the same result. So you know it happens all the time that you know. Maybe uh somebody reads your paper and that will see a result and and won't be able to to you know, will download your data.

F

I won't be able to reproduce the same thing right, so we want to at least be in the position where we can guarantee that whatever went into the publication, whatever was in the figure, we can reproduce it exactly now. Obviously this doesn't guarantee that that result is correct because there could be still bugs in the code or whatever the analysis code could be incorrect, but having it be, reproducible is the first step, at least to you, know, to being able to start from a common understanding of where things are.

A

Cool yeah. Thank you very much. um There was also a question by jesus regarding this branch and merge system, uh because every time you have this type of system, you can have conflicts. When you try to merge everything together. Could you comment on that.

F

Yeah, so um um the the reason why we built this system was really the fact that when you do this, when you do it, when you spread to do this import, obviously you know you import the data the first time, and you realize that uh you know you made a mistake on your configuration file or the data is wrong in some way. So you need to fix it before you import it right. So the idea it was it was impossible.

F

The idea of merging directly trying to go directly from from from a user to production would have been crazy because we wouldn't end up with a ton of junk in the production database as user either into this process right. So the reason why we built it was really allowing users iteration.

F

That doesn't mean you know that there is still a- and I put it here in this diagram- an admin approval process right. So when, when is a new data set, that's coming in so when it is a new data set, that's coming in a user, an admin essentially looking at it and making sure that the user has done the right thing and then it gets merged.

F

The default dates actually will build a custom system based on the atomic which, basically, if if the case is that a is already in the database, but we have an updated version of a, we have built the system that basically transfers all the entities in the in in this data set, determines all the differences and only transacted the sets of differences that are necessary to bring the data sets in in in the same status.

F

So this is actually a very complicated project that we just recently finished, and in order for this to be possible, we really had to to make sure that um you know. There's lots that goes into making this possible, but like one important thing, is that every single entity in the database needs to have a unique domain based identifier. So not not just a db identified, but we have domain based identifier, so that you can able to say.

F

Is this gene expression measurement in the old version of the data set the same the same as the generation?

F

The new version of the data set right. So actually this this differ. This d thing of existing data is a much more complicated problem than just the idea of importing a whole new data set in production which is really about. You know an admin approving and then triggering a series of processes in the in in the cloud that basically take all the data importing the production. This is semantic is much more.

F

This was a much more complicated than they were to to calculate all these differences and making sure that it's done in a correct way, but this is done automatically.

A

Very interesting there is also a question. I guess that both you and sivaram can talk about. There was some discussion regarding the use of ontologies in europeans, like we saw initially uh how severan explained, and it seems that in in kendall, the ontologies end up being more like a graph or or flatter, and there was some uh discussion that there's some challenges in using hierarchical, hierarchical ontologies, and I think this this discussion is interesting.

A

uh I would like to hear from both of you uh I guess first, why is it a challenge and what we can do about it.

F

Yeah I mean I I can give. I can give a little quick perspective from the candle side and then uh uh videos to hear sivan perspective. So you know we are in in ontology, there's a big difference between an ontology and a control vocabulary.

F

So a control vocabulary is really a list of terms that are valid and ontology is all this hierarchical concept, relationship between the concept. We are using more ontologies as a list control vocabulary, so making sure that terminology is consistent and and you're right. We're not we're not using.

F

We very much strive to use flat vocabularies, and the main reason for this is that in our in our for our specific use case that this process of standardization mapping ontology has to be done by data scientists that contact the patients or the willingness to learn all the intricacies of the on top of these ontologies, like so, for instance, I'll make an exam very simple example in terms of disease time, so that you know, obviously we want. We want the disease type to be a standardized control vocabulary, so lung cancer should always be lung cancer.

F

Okay, so that's that's the control vocabulary aspect of it, but at the same time, some of the ontologies that are using the biomedical field that really elaborate in terms of placing this lung cancer concept in a very complicated uh tree, as civilaram was was, was explained in a very complicated trees and taxonomy of classification of cancer.

F

We decided not to use that and to prefer another ontology, that's a bit rough, more rough and less sophisticated, but that's the properties of being essentially a flat control vocabulary, because we really you know in order to sometimes in order to be able to do this map precisely you need to really be a specialist in these ontologies and and our scientists. Our data scientists are definitely not special, that they're not gonna, I'm not gonna, be kinda anytime soon,.

B

Yeah, so these uh I I agree with uh what freddy korres said. You know using the hierarchy.

B

uh The taxonomy aspect of the ontology is a fairly advanced use of the ontology uh for the for the most part, uh the most common use is to flatten the the the structure of the ontology and use it as uh as lists for multiple ways you know which so which moves more towards what you are saying: critical as control vocabularies, and but it it doesn't mean that uh you it's a wrong way. It's an incorrect way of using an ontology.

B

It is one way in which an ontology can be used right, the same the same labels and the same words and terms that you're using that you're getting from the ontology once you have mapped them into your data, and you from uh you can still make use of the structure. The ontology structure, the the both the hierarchy, as well as the other relationships that actually make it turn it into a rich web of. uh You know a rich graph.

B

uh That's you can use that uh for querying and for reasoning and uh uh for for for doing uh you know a lot of for for generating new information that was previously not or or not asserted in the data itself right. So a simple, a simple example of something like that would be. You know in your data set if you had, if you had a cancer that was labeled as oral cancer right and you have other cancers which are uh labeled as a nasal cancer or cancer of the error and another one like ear cancer.

B

But if you are going to now start querying hey, I want to find out all patients with head and neck cancers right. The the word head and neck doesn't doesn't appear in any one of these. uh You know data points that you've just uh stored right, one says oral cancer, another says nasal cancer. Another says year.

B

So this is where the power of an ontology comes in, where, even though your data is at the level of the more granular terms, you can pull back up and do a subsumption query where you can query for all the subclasses of what a head and neck cancer is, and that's where you take the taxonomy the hierarchy structure and do that query and and bring back all the results that match uh oral cancer and all these, because these are all these all of them fall under head and neck cancer.

B

That's one way of doing it, I mean it's. uh You know it sounds very simple, but I I think, as uh probably fredica knows, it's not the simplest of things to do.

B

Yeah and- and this is something that you can do in a triple store like uh virtuoso or blaze, graph or elegrograph, which are which are pure rdf, triple straws semantic ventricular stores. uh The atomic does not support this kind of reasoning to one of the areas that I have worked with in the previously is: how do you represent this kind of ontological structure into an atomic schema and be able to develop uh custom algorithms?

B

So that you can do this kind of reasoning in india comic, because if you look at, if you uh I think I didn't mention it and I think frederick also didn't mention it. The sparkle language is actually very, very close to data log. The and the rdf triple structure. Triple structure is uh very similar to the eav data structure in uh in the atomic and uh and when I met rich hickey.

B

I asked him about this uh because I've been I'm coming from a semantic web background and when I first encountered clojure and then comic and met rich hickey at one of the closure conferences. I asked him about it and he said yes, uh closure and atomic borrows a lot from the semantic web space, and so you can see a lot of that. uh The thought process that went into developing atomic you know.

A

um Tom thomas, you had an interesting comment. uh Would you like to elaborate.

C

Which particular one.

A

uh You you say that uh if the fact star covers a semantic field, well enough is large enough for diverse stuff.

C

Ben, I think I think this was in reaction to ben's comment that that he could sort of infer the relationships from the facts or- and I was I was a little skeptical. I suppose, if the facts are, is large enough or or diverse enough, you could do that, but to cover the semantic field that you're trying to extract.

C

But if you were, if you had say a very narrow, um very narrow set of relationships, I'm not sure whether that would be sufficient to to really get back to a real ontology.

C

Do you agree or.

B

So you know in a very small space if it's a small problem that you're dealing with, I would definitely say that you know if you're, using an ontology. That would be an overkill. It's like saying, you know if you want to put a if you want to. If you want to uh dig a small hole to uh to plant a small planter right, you don't want to get a big tent on truck digger to dig that hole right, so you have to use the right tools for the right purposes.

B

So if you have a small, very small, well-defined space, then you know writing your own relation. You know how you want to find relations within them or what you want to infer from. It is probably going to be just as easy as trying to or maybe easier than trying to uh use an ontology or for that matter, something like the atomic for it.

B

It's it's a non-trivial uh process.

C

So the trade-offs on your on your controlled vocabularies are what perhaps you could speak a little more, I'm assuming something like you can't easily identify hierarchies hierarchical things like classes and super classes as easily. If you just have a a list of controlled items, yes, I mean what did we talk about? I didn't quite follow that. Well, so you said you're using controlled vocabularies, which are really just lists.

C

I assume that what you mean is they're sets of items so that you identify head and neck cancer as being in the same set as as some other term related to group cancer, or something like this so you're not reusing a hierarchical oncology here. So.

D

C

Are the trade? What are the trade-offs for doing that? So.

B

Yeah yeah, so the the thing about control vocabulary is is: uh is uh that you can you can pretty much put anything in there right? It's not it's, not a pure taxonomic structure. Now uh what what does? What does the hierarchy mean in terms of uh taxonomic structure or in term of terms of an ontology right? So when you say a is a subclass of b? What does what does that mean? So, for example, you say uh a dog is a subclass of animal. The the class dog right is a subclass of the class animal.

B

What does that mean what it.

G

B

That what it, what it means is that every instance of a dog is an instance of an animal, and there is no instance of a dog that is not an instance of uh there's no instance of dog. That is not an instance of animal. It has got that uh that kind of connotation to it, formalism associated with it. So when when, when you descri, when you create, uh when you create an ontology, when you create a hierarchy in an ontology, this is the kind of uh uh discipline that you are following.

B

But when you look at control, vocabularies and I'll, give you an example from icd-9 codes. Right. Icd-9 code is the diagnostic codes that are used in medicine and, if you look at, if you go to icd-9 and hypertension area, right, you'll see hypertension, which is a disease, but you'll also see something like hypertension with heart disease, hypertension with a chronic kidney disease.

B

Okay. So if you think of this, as from a hierarchical point of view, you are seeing you're saying that hypertension with kidney disease, so there are two diseases there. Hypertension and kidney disease is a hypertension and that doesn't make any sense. A kidney disease is not hypertension yeah, so that's.

C

Why I'm sorry you're going across two axises right? I mean you're, saying it's a it's a certain disease, but it also is related to another disease. What does that sort of imply that you, you would want to move to a more general graph structure where you could represent more relationships than simple yeah.

B

Yeah so so is a the is. A relationship provides the hierarchical structure right and along with that, if you, um if you, if you were to uh the example that I gave right, it was about uh from pain, pain, uh says, chest pain is a type of pain and then it has sight the chest so that provides the other kinds of relationships with goes towards enriching this graph and that's what you're missing out when you use just a control vocabulary, you don't have any of those things.

C

Yeah, thank you.

A

All right, um thank you very much to both of you, um since we still have some time until our deadline for the end of the meeting, and there was um there was a list of topics and some questions that we gathered during the the rsvp process. I would like to just open the floor. Wonder if someone has any further questions. They may be data science specific they may be about. Someone asked about the state of the data science, tech, for example, in closure.

A

The floor is yours, feel free to ask any question to anyone at this point.

A

Yeah go ahead. Jesus sorry.

H

A pending question uh to see what I'm in his architecture, how do you use closure.

B

Okay, so in my I I don't use closure for most of my work. uh I uh I use closure more on a personal side for doing most of my own research projects. My own skunk works kind of a thing and that's where I've used it all. The way from you know for building data science pipelines to uh even uh web applications as well as for me, some. uh You know, nlp and more recently, in struggling with some machine learning kind of areas, so on the side of semantic web and ontology I have worked in.

B

I have done some work around being able to use ontologies and ontology schemas and being able to represent them in the atomic, and crux is crux: I don't know if you've heard of crux crux is the other database which is similar to atomic, and so that's another one that I have used and to be able to represent ontologies in uh as this as a schema in uh atomic and being able to leverage the hierarchies and other relationships within them.

A

Yes, you mentioned you were doing. Is this a machine learning an lpn closure? Is that correct? Yes, could you talk a little bit about the stack, I'm personally very curious about this matters.

B

uh With with nlp, uh I have used opennlp for the main part um that seems to have pretty decent closure support. um The other ones that have come across are like stanford, nlp and but I haven't. I haven't used that, and I know that there are other ones where you can come across examples on the net, but I've I've not found them. uh You know I've really struggled to make them work both for nlp, as well as for machine learning.

A

B

A

Leandra, that was crooks crux. It's the database, that's similar to the atomic.

B

Yeah, so it's I'll put it in the check, chat.

A

I was wondering for the the candle people. I notice there's more than one here in the chat. Still is anyone using anything else, besides r for doing the actual analytics.

F

We're not in our team, not we're all using r, but you know one could imagine obviously the same way that we built a system to use r. We can imagine building a system to use python, even though I think a lot of I'm not as familiar with python, but I don't think that uh the same level of symbolic manipulation is possible with python, so the syntax might be a little bit more string based. But uh no, in our case we all use our.

G

And, oh sorry, federico.

F

I was just gonna say, lizzie and ben are in. I are in in the in the audience, as you guys have seen so.

G

Yeah hi, I'm lacey. I also work at the parker institute and work on candle been chatting with some of you. um I'll also say that the large reason that we use r for all of our data science and analytics, is because of the extensive library of computational biology tools that exist in r. So anytime, a new paper comes out with a new method in computational biology.

G

It's always written up first, as an r package, um so being an r for us is pretty critical to staying able to use the most current methods in the field. So that's why we're in r.

A

I guess you use bioconductor yep.

G

Yeah bioconductor a lot of things. Some people release things not through bioconductor, but yeah cool.

A

Another question about how candle research is usually published.

F

What do you mean like research, that we get the results that we got from candle? I guess.

A

F

Yeah I mean we just uh that they they end up published. um I mean in scientific publication in journals, the same way the same with it, the same way that would be in uh if, if you weren't using candles, so it's it. This is, from the all intents and purpose that this is like an infrastructure details that it's uh you know will be mentioned in the in the publication.

F

Maybe, but it's not really, uh you know most of the time it's not really relevant to to to this kind of paper that really focus on the results.

A

Jesus, you have a question.

H

I I have two questions, so the first one is I'm noticing, then that a closure is not the favorite tool for that analysis. So what you as a community know, is closure in this place.

H

Is it possible to connect to r? I I I heard before about python. I know something like you can connect to python. Obviously, with closure, you can use all java libraries and all there are javascript. Is there something for her?

H

That's like the same question and and the other question is I'm curious about uh you reduce case sivarance and the two times the stuff they provide.

B

Sorry I didn't get the question. Jesus.

H

Yes, so krogs has like a two axis for time. Okay, right! uh That's what I understand this is the atomic is time based cross is time based, but they provide like a second second line of time, and I was wondering how do you? How would somebody take advantage of that and if, for some reason, you're using that or you're using it as an atomic.

B

I'm using it more closer to what the atomic functionality is providing. um I haven't really explored a lot about the the time. Travel aspect that crux is is uh uh you know, is featuring yeah yeah, I'm.

H

Playing a bit with crocs, but really uh in the same place right just using it.

A

I think maybe daniel could you comment on our interrupt with closure.

I

Yeah there is a closure library for calling r from closure and it is called clojisser and I would love to discuss it. It is still not used in production anywhere as much as I know, but we are working on making it more stable. So let us talk about it because.

H

Hey good to know, thank you gloria, claudia store, something like that. I didn't understand it.

I

I I'll put a link at the chat logista.

H

A

H

The link yeah and what about that analysis in general, so how much does need closure to to be at least considered in that field?.

F

Well, I mean I I can maybe answer this question just from my from from my perspective, so at least he really answered it already.

F

You know in our field, you know the there is such such a mountain of of analytical routine, already written in art that the the obviously the idea of providing them enclosure for for us is a non-starter really because of the you know, that's the standard, the field these are so the the approach that we have taken has really has to do with like using closure for for all this data processing, and then- and you know, a lot of it- is done in an abstract way, because it's you know it doesn't, uh for instance, the the tools that we built are all schemagnostic, so they don't have uh you know, pred doesn't know anything about the atomic schema.

F

The atomic schema is read as input, so a lot of a lot of these computation happens at a pretty high high level from a from a semantic point of view, because it doesn't actually know the semantics of the schema interprets the semantics of the schema, and so you know all for all of this stuff closure has been fantastic. I don't even know how we would have written this in in any other language, uh but you know, and then so, the the the the the way we're using did is, though, is really confined to a specific.

F

um You know, problem domain, for which we think it's really well suited, and then the the you know the then we're using uh you know r for for all the analysis, because uh yeah once again, we we wouldn't be our users are all in our we're, not gonna they're, not gonna, learn to use closure. I mean that's, that's just that's just the reality.

F

But I will also say that so what one thing is: that's really interesting. So what we're doing is not it's different from from what daniel was mentioned, so we're not trying to either you use like embed our enclosure or bad closure into our obviously right. So the system is, is closure, enclosure and atomic is a web server and you could query the web server. However, you wanted, we happen to be querying in our because we use r, but I guess what did what daniel was talking about and what maybe other people here are hinting about.

F

So slightly is actually a very different problem which is use. You know having the r interpret talk to the gvm and and vice versa, which is really not what we're.

A

Doing yeah guys thank you so much everyone. We had a wonderful discussions, as someone suggested, I am saving the chat. I think we have have wonderful information in there and we'll find a way to share it appropriately.

A

Unfortunately, we are at the end of our time today. I hope everyone learned something new. At least I know I did, and I definitely encourage everyone to keep discussions either contacting the the speakers or joining the clojure and zulip channel or either getting in touch with me, daniel anyone in the cyclash community. We are always eager to have more people talking about data science, data, science and closure and seeing what you guys are doing out there.

A

Thank you very much and see you all next time.

B

Thank you, santiago, fantastic moderation, daniel as usual, thanks for hosting this and frederico loved your love, your talk, thank you. Thank you. You too, thank you. Thank you is.

I

It okay to ask: maybe jesus could talk just for a bit about the spanish speaking closure project.

A

Yes, I completely forgot, I'm sorry.

H

Yeah, no, no! No, then I was just mentioning I I didn't want to take this time for that. No, no, but just mentioning we I'm really happy, because uh I I we have I've been able to collect a group of people, interesting learning closure. We created this community club called closure, closure hispanic uh and we just recently like revive it because it was existing like last year but like in the last month. uh We are reviewing the group and is really really active.

H

Now we are doing inspired by you uh with this idea of study groups. uh We use propose it and we are doing. We are reading books now about closure, and basically I'm here like spying and taking notes about how to manage uh study groups, but we are like 26 people and all of us have been doing this all the weeks doing the the classes uh and that, I think, is a good signer.

H

Well, that's it so we are people from uh spain, obviously and all the americas, I'm from from venezuela and in canada, but I'm from venezuela. We have people from nicaragua, peru, argentina and spain right now, and it's good to find people speaking my language, but I can do a better level in communications for sharing closure, knowledge and ideas.

H

A

Jesus, could you maybe leave a link in the chat, so other people can yeah sure.

H

I'm going to share the link of the telegram group right now and then I, when we we have a domain for the group, but we don't have a content right now. Well, so that's why I'm going to share only the telegram group link. Okay,.

D

Jesus is there any other group that or community that you are in, like mainly spanish-speaking, but could be also not necessarily spanish speaking.

H

Well, I managed I manage a group called python venezuela, I I will have more content more. We have. We are more mature years. We have the page and everything and as many many more people as the other group and managing and founder, that's in and we have a foundation in venezuela for that this week is half people from many places and not only venezuela. I don't know why he's good, um but I'm all in now, enclosure like I I use I help with the fight monster group all fridays. I do something called a consulting hour.

H

I spent one hour in gtc, one video conference and people freely come in and ask questions right, and I just try to help or try to answer the questions about python. uh When I get some or that's an idea right. If so, if I get some expertise in closure, I expect to do the same thing for closures on daily. It's one one hour a week. I just put the link in the group and no no scheduling I'm just there waiting for them. Then randomly people come in and ask me something about item and we try.

D

To fix something or explain something like that, I was asking because I'm from argentina- and I think it's a very useful initiative because most of the documentation is normally in english and many people- don't speak the language, so it's a huge barrier for them. So yeah. I think it's well.

H

Energy, you must know sorry, do you know that community python, argentina.

D

uh Yes, I'm not involved in it, but yes.

H

They are the best example of I think it's the right example of a community in a local language. They did the whole translation in spanish of all the python documentation and everything is in spanish and it is wonderful and we try to. When we created python venezuela, we tried to mimic python argentina, but we didn't need to do translation. Everything was so everything I put links to python written I when they need to read documentation in spanish, but.

D

H

Anything like that enclosure not going near so that would be the most important I think initiative to just have a translated content. I I I know by first hand how different it is to reading you know in spanish and in english. It's her to heaven right so yeah. That would be a good goal if we collect enough energy, I suppose.

A

Jesus, I think it might be also useful to share this in the zulip um data science channel. You probably will reach more people, and I think you and leandro are now you'll probably have a lot more to talk about.

A

Yeah cool, then again, thank you very much everyone. uh This was awesome for those that are starting the day, have a nice day for those that are finishing, it have a nice night bye. Everyone. Thank you. Bye-Bye.

B

G

B

Thank you and bye.