Koha-US Online Together 2020 Conference, 23 Sep 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Koha Mig - A Git Like Tool for Migrations and Data Projects

Description

Presented by Rogan Hamby & Jason Ethridge, Equinox

Slides: https://drive.google.com/file/d/1yIb9nDlvEIzLmMASaQdP0SxH2ax3JSsX/view?usp=sharing

A

Welcome good morning, thank you for joining us for cohan mig a get like tool for migrations and data projects. My name is rogan hamby, I'm a data and project analyst with the equinox open library initiative and I'm go ahead. Jason. I was going to say you're here with me today.

B

Yeah, I'm jason etheridge, I'm also a data analyst and a software developer with equinox.

A

And we support open source software and libraries, especially evergreen and koha, and we do a whole bunch of coho migrations and we're going to talk today about koha mig, which is a tool set that we developed in-house and is publicly available.

A

So just a wee bit of history before we get into the nitty gritty stuff. Mig is a tool set that borrows its feel from git. Now, by that we mean that it is a command line tool, it is kind of a wrapper for other, more specialized tools. Jason will talk later about things like mig, add mig, remove stuff like that and mig is simply an abbreviation for migration. It was originally developed as a migration tool set and lives inside a larger migration tool repository.

A

However, of course, migrations are ultimately just big data projects right, so it's not terribly surprising that a tool set. That's useful for migrations is also useful for other data projects, and so we'll talk about that later, but we're going to focus on the original set of use cases that led to the creation of mig first, and that is migrations.

A

Now I said that it lives within a larger collection of migration tools and because we're not horribly creative, nor want to be obscure when it comes to these things, we call that larger collection migration tools.

A

It is a collection of tools that are primarily written in perl, although you'll find a number of bash scripts, xml files, sql the occasional cat picture or easter egg in there as well, just whatever has been useful as a repeated use, utility and doing migration work, and these live on github. You have the address right there. It is under the equinox, open library, initiative organization, migrationtools.get, we track issues there.

A

We take merge requests and you can find all the stuff we're going to talk about today there as well as a whole bunch of other stuff, a whole bunch of other stuff, and to talk a little bit about some of the technical elements and prerequisites I'm going to pass it over to jason. Now.

B

So migration tools started off as mostly for evergreen migrations, and we were you, know, evergreen and coli. They both used pearl, and that was the language we're most familiar with at the time. So there's a few pro modules, we need switch text, csv, auto text, csv separator.

B

We don't have a proper installation process, so we just kind of run things directly out of the git repository. So you need to update your path. The pearl, 5 lib, that's optional.

B

If you're going to use the mark cleanup utility, then you need that and if you have trouble, I have trouble installing a text csv auto, sometimes often I put those directly into the equinox migration lid directory and and when I do that, I need the pearl fiber lip as well yeah. The other thing about uh mark cleanup is good for adding tags, uh sequentially numbered on the fly, and you don't often need to do that with co-op migrations.

B

This is how data can be represented deep under the hood within cohab, so a table, you can kind of think of a table as a tab in a spreadsheet, or so you have the you know, different columns and fields and each row and a table is kind of analogous to a row in a spreadsheet, and a migration process is to take data. That's similar to this and to beat it into shape where it fits within these rows and columns.

B

The type of data you might get you've got record oriented data. This example actually comes from a symphony system.

B

And the fields for the most part you might a certain record- might have a given field or column that not every other record has so it's interesting to work with.

A

In relationship to mig, one of the things for people to know is mig doesn't care where that data ultimately comes from.

A

If you have a tool set for converting data from a certain xml or json, or even pdf or whatever, once you get it into a tabular format, it'll work with mig, and I mentioned the migration, the larger migration tools repo a few minutes ago. There are a number of non-mig tools in there that are specific to certain data sources as well as there are, of course, other repositories elsewhere that are sometimes specific to given ilses or data sources for converting.

A

That kind of moves us into line oriented data.

B

Yeah, this is what we prefer. Definitely and often, if we get other data sources, we will convert things into this format first and then shove it into staging tables, and even this is not always easy to work with, because csv, you know things.

B

Csv is not always csv, it's not a rigorous standard there in my opinion, um but we do have tools for dealing with it. So there's a you know, clean up step that can kind of fix a lot of csv for you.

A

Yeah, I I think calling csv a standard might be a stretch. um A a frequently abused gentleman's agreement might be the best description of what csv really is. It's.

B

Kind of like sif in that regard and what you're seeing there on the screen is actually pipe separated, pipe delimited data in some ways: that's preferable to this pure csv, because you don't often see an actual pipe in the data itself. So you don't have to worry about those delimiters being escaped or quoted.

A

So, ultimately, this is your goal of converting your data to get to a line-oriented file or what I tend to call tabular file and then you're going to want to stage it like. We said- and I said before- this was first talking about the use cases of mig why we came up with make so we're going to start with a non-optimal way to do this migration process.

A

So what's the most non-optimal way, you can do it. The most non-optimal way is to do everything by hand, sit down type out, create table user, underscore data bracket and I'm not going to go through every line, um but obviously list out every column of data list out its data type.

A

But if you have dozens of files and potentially hundreds of columns, this can be very time consuming, especially for quick, prototyping work and then, after you do all that, you need to actually load the stuff in, and we have here on the slide, the mariah db or mysql, whatever flavor you use, load data, local infile, and here we have columns terminated by a comma as we talked about before it could be a pipe.

A

A tab whatever and during this process you need to watch out for collation because it's easy to end up with a table encoding that is not going to play nice when connected via joins to other tables in the system. So you got to be very careful about that, and so this is the very manual long time-consuming way to do it and I feel like we could probably have two or three slides of gotchas here. What do you think? Jason.

B

Yeah and and there's one other even more non-optimal way to do this, and that is not to use staging tables at all. So we're big proponents of staging tables.

A

B

A

B

People trying to munch this stuff in programming languages and then shut the data directly into go.

A

uh We talked about non-optimal staging, so let's talk about non-optimal mapping.

A

This is where you go directly from a staging table into a production table, and this is just one tiny step of improvement away from what jason was saying about going from external data directly into your production table here, you've put in a staging table and called user data and you're just going to select it and you're going to do all the manipulation you need to use directly.

A

Now here on the slide, we just have a couple of very simple manipulations card number maps directly: the user id or rather user id to card number user last name, maps directly to surname and we're just trimming the spaces off the end. But some things are going to have much more complicated manipulations. Let's take names. For example, it's not uncommon to not have a last name in its own column.

A

It's actually pretty common to get data where you have something like smith, comma space, jane, comma, mary and you're, going to have to take substring commands in sql with positions and start splitting that stuff up and putting things into surname first name. Maybe other name, if you're going to keep the middle name or combine first and middle into first name. However, you want to do it and you may have to bring in data from other sources, some of which may be several tables of connection removed.

A

So if you're going to go straight into borrowers, it's going to be a lot of very complicated manipulation uh and the more complicated the migration is. Of course, if you have to do it all in one single big transaction shot, the more chance there is of some sort of error, sneaking in any more thoughts on that jason.

B

Yeah I mean you can do it this way. If it's super simple, but if you're developing it, if you're doing it in a room fashion, then it gets a little more difficult.

B

I mean you could comment out that insert and just do the select and see if it blows up on you, but if you get a lot of if you're dealing with a lot of data, you're not going to be able to tell at glance necessarily or browse through or skim that data and catch edge cases, whereas if you're doing something like this, you know on the next slide in a way where you're actually manipulating things within tables before pushing them into production, then it gets easier to catch. That's what it's done.

A

Yeah and- and I would argue that even if you have something that's super simple- it's probably a bad habit, yeah yeah! So let's talk about better staging tables. A little bit jason already mentioned, there's a next slide and boom like magic. Here it is when we talk about better staging tables, I'm going to start talking about quite a few conventions here that we follow in our own workflows and have kind of snuck their way into mig. Now, perhaps in a perfect universe, tools are completely uh agnostic, they don't take in any of your workflow conventions.

A

um I don't I don't know of a case of any tool, sets that don't have some workflow conventions built in somewhere, and I think these are pretty sane ones.

A

So I want to make you aware of them as we talk about them and why we have them as well as that they exist so a better staging table instead of manually creating a staging table. We start by creating a table like one that we want to inherit columns from so create table m borrowers m underscores our convention that it's, a migration related table like borrowers, and what this will do if you're not familiar with it, is basically create a second borrowers table with a new name.

A

Now. This is the part where jason and I both feel a little bit of uh paying of wishing that my sequel, slash mariadb had child tables like postgres does because then you can actually share a number sequence uh and my sequel and maria. When you do like it's, not a child table, it's actually completely separate. Just has an identical structure and the sequence is going to be completely separate.

A

So there are a few things we have to do in order to deal with that and we'll talk about that in just a few minutes now. What is the use of having a copy of borrowers? Well, the advantage is that all that manipulation I said we don't want to do in a production table, yet we get to do it in this m underscore borrowers instead and then, when we're ready, just copy.

A

All that stuff over to borrowers and the real value of this comes in when we look at the next couple lines on the slide, alter table, add column, and we have two conventions we follow here. One is l, underscore l underscore means it's legacy data we are bringing that data in from one of those tabular files that we talked about, and our convention is that we never alter that.

A

That is a pristine copy of the data as we pulled it in so that if it's not what was exported from the system, it's at least what was in the files that were imported, and this allows us to know that if something doesn't match up somewhere, we need to look for maybe an issue with loading, the data, but one way or another. This is the migrated data x underscore is our convention for something that we've manipulated or calculated in some way.

A

So to go back to that example of the name. Let's say you have a name. That's all in one text, string last name, first name middle initials, perhaps, and that comes in an l, underscore name.

A

I may then break that up into an x, surname x, first name and other x, columns that I manipulate with my text tools and functions before they end up going into the m borrower. Columns that I inherited from borrowers.

A

Now I mentioned needing to synchronize earlier, because we don't inherit a sequence: we don't have a shared sequence between m borrowers and borrowers and when I say sequence, I mean of the borrower number, of course, that automatically generated number that's unique to each row after we push that copy of the columns from m borrowers into borrowers.

A

We immediately do an update where we say we take the x borrower number and we set it to the borrower number. We set the x borrower of the migration table to the actual borrower number in the real production table based on the card number. At that moment. That means that we know from then on that x. Borrower number represents the actual production row, regardless of what's changed on that borrower's account from then on.

A

We have that link available, which is really useful for later data projects for identifying issues from a migration for just answering questions. A lot of the time.

B

And for linking circulations.

A

B

A

B

A

That's a good point actually yeah later data stages like circulations, you need the borrower number and you need the real borrower number and what if somebody comes along- and you know ideally during a migration- nobody's going to change anything on your stuff until it's released to them, but I have had it happen where somebody got into a system early and decided to go ahead and start replacing cards now I don't have a reliable card number for matching up circulations that are loaded a week later, but with that x borrower I do.

A

I don't have to worry about that kind of thing. So now we're going to talk about better mapping a little bit jason.

B

Yeah, so this shows another convention that we're doing rogan already mentioned that we leave the legacy columns pure- and this is where you can you can do this all in one update. You can break up into multiple updates.

B

The point is, you can iterate how you want to, but this is where you try out your mappings in this case hard-coded mappings, for example, the on a symphony system. Well, user id would make a good card number it's kind of the same concept, so you would just map it over directly.

B

Surname I'll use your last name. If you had to do more convoluted munging, that's where you would do it as well, you could put you know, replace and regaps replace and there's those sort of things in here as well x, migrate! That's another convention! We do uh during mapping. uh Libraries will often take uh met the mapping process as an opportunity to clean up their data.

B

So there are certain things they may not want to migrate, and this is one way to indicate that often during a migration process, you see your data in a different way and you kind of realize there's a lot of things you might want to. You know not propagate with.

A

Yeah- and I will say that when we talk about legacy columns, l columns, they're, usually text just for convenience, sometimes we might manually go in and change them to a var care of a certain length if we want to index them or something, but the x underscore often data typed. So, for example, x migrate is usually a tiny end, because we only need a one or zero and others will often be of a data type for something convenient to move over into a production column.

B

There's a another reason to do things like this. If you think back on the uh the first bad example of inserting directly into production tables, uh mysql has this bad habit of truncating data, and yes, so you'll see warnings and things you can do. Show warnings after you do that, but if you've already pushed it into a production table, that's kind of too late yeah.

B

So you can insanity, check your mappings here and see and summarize, and do you know further analysis on on what you've done, and this is another example where we really lament the lack of child tables.

B

But this works I mean so you have your staging table. You've done all your mapping and you just insert into the production table the same columns in the same order and you flag it where you're just doing the ones that are supposed to migrate, so possibly a subset of what's in the staging table, um so we've talked about staging tables and but sometimes we do still want to insert directly into things um and those things can be staging tables, but you don't have to uh necessarily lunge or mat them the same way.

B

So this is, you know, it still gives you easy, iteration you're, not polluting. You know a production table or anything, but sometimes you have derived tables or auxiliary tables where it's going to be pretty simple. Just to do things like this this in this example, this is a staging table. That's going to eventually get pushed into the statistics table in cohabi, and this is specifically for circulations, but you might have another staging table. That's you know intended for statistics, but it's for something else related to binds or.

B

Whatever um and finally other things you might want to stage um so we talked about hard-coded mapping, but you know these for the migrations. We do. We often give the libraries an opportunity to determine exactly how they want things to map. So there's a bit of more data driven or soft coding going on here we usually give them spreadsheets and then they kind of pick and choose how they want to, especially with consortia how you want to map.

B

You know your legacy data into specific.

B

You know you might have, for example, you might have a smaller set or a different, totally different set of shelving locations that you want to map into than what was on your legacy system, and once you have that mapping you can stage it and then you can, you know, do table joins and do your manipulations that way.

A

Now, all through this we've talked about bad ways to do things and good ways to do things, but we haven't really talked about how mig plays into it. So we're going to do that now and as we talk about mig, what we're going to talk about is going to go back to that better way, to do things better ways to stage better ways to map better ways to load and how mig allows us to take a migration.

A

Where say, I have a dozen files, some of them with as many as 30 or 40 columns in them and with very little effort get them all staged in the database, linked up to copies of production tables, with l, columns and data loaded and all that stuff in just a few minutes.

A

So let's meet the mig tools. These are, I don't think this is all of them. We may be missing one or two off here, but here are a bunch of the kmig tools and you'll hear us talk about kmig k-mig is in contrast to e-mig when mig started, it was just mig and it was for evergreen as we've done more and more coho migrations and we do a pretty steady stream of them. Now we discovered we really missed having these tools. We used an evergreen for co-op and they weren't really evergreen tools.

A

Actually they were tools for postgres, so we took those and we split them so that one set of tools could cater to evergreen's, idiosyncratic needs and postgres, but we could have the same functionality uh for koha and mariah db and, as it's turned out, some tools have turned out a little bit different.

A

So jason, do you want to kick us off with talking about the environment commands and how those work.

B

Yeah, but just to back up a little bit about the how this is modeled after git, I was really enamored but I'll get sub commands and how you implement those sub commands, and uh it also gives you kind of a conceptual framework for doing migrations here.

B

So once you learn this and you just get kind of the same steps and the same it kind of encodes or enshrines the workflow, and one of the first steps is to create the environment, which is um what references what database you're dealing with and your where the data is actually located on the file system. You can bundle all that up, and so, if you jump in, if you're, jumping back and forth between migrations, this helps you keep things straight without you know, cross-contaminating anything.

B

So that's the environment subcommand that handles that init init.

B

It actually writes things into the database once you have it configured which one you're talking to it creates a tracking table that gets used by the quake and ad commands, and a few other things to get added. Some convenience store procedures that are used are useful for migrations for the mapping step.

B

Mig-Link, this is actually that's what associates a incoming source file with an existing uh production table and it'll. If it doesn't already exist, it will do that whole create m borrowers like borrowers type thing when you specify it here so this this makes those uh those uh more useful staging tables for you, but it's optional. You don't have to associate a staging table with a production table you want to take over on the other ones right.

A

Sure uh k-meg's status is very simple and, and let me say, backing up we're going to go into more detail on our upcoming slides about each of these. This is just to give you a quick overview for context.

A

Kmig status just tells us what status the environment is in, and it's going to give us information quick sheet. Quick sheet is extremely useful. It gives you a statistical overview of what's in a tabular data source, so you have these files. You want to get a quick overview of them. What kind of data is in each column? What kind of value ranges you have things like that and is extremely useful if you are working with a project manager or other people on your project who aren't technical, they don't know tools like awk and grep.

A

They aren't database people, but they really want to look at an overview of the data. Kmeg, quick sheet is just extremely valuable because it'll give you that to give to them without you having to spend hours and hours manually summarizing it yourself.

A

Bib stats is exactly what it sounds like you pass it a file of mark, and it gives you a quick overview of it of some major information that lets. You make some initial decisions going in reporter.

A

Again, that's a lot of what it sounds like I'll talk more about this when we get to it, but there is a default file of reports with it that are useful and again take advantage of our conventions, but you can also give it any arbitrary xml file of reports and then kmig, import and export, which I collectively tend to refer to as kmig sync.

A

These are used for exporting from a coha installation, a series of xml files that give information about a bunch of stuff in the database, from system preferences to circulation, rules and a whole bunch of other stuff and then allow you to import them into another koha installation, and we use these heavily with test and production systems.

A

So when we bring a library in, we will take an initial extract of their data, set them up on a test system after we've scripted the load they're going to work on that a lot and then, instead of making them manually, recreate every little thing on production that they did on test. We'll often do an export of one or more options from this tool. We'll talk more about those options when we get there and then just import them into the production system form so easy. Peasy.

A

So, first we're going to talk about creating the environment a little bit. This is for our migration of innsmouth or ins mouth public library.

A

When we create the environment, it's going to under our home folder in a little hidden, folder called kmig, create an env file and in that env file will be the information to create this information that you'll see in the environment and a lot of it's going to be pretty obvious kind of stuff. What is your mysql database? What is your mysql user? Your password, your host, all that kind of stuff by default.

A

It will look to create it for your local host, but it will give you some prompts when you run it and ask you for some parameters and if you, for example, give it the parameter of a coha installation that exists on your system.

A

You know one where you have the quahog conf xml file for it under coha sites and all that it can pull a lot of this information automatically for you from that file and populate it for you, and we have a few conventions that are again defaults in kmig, such as our convention of using a get migration work, folder and a data folder. That is where we tend to toss the raw data files get migration work. We use shared git repository, that's internal to us for our scripts and things like that. But you can change these.

A

You don't have to use them anything. You want.

B

Yeah, the reason why we need the mysql information is because kmig is written in perl and it uses pro dbi and it does things directly to the database. So we're not always working through. You know sql scripts and things like that. We actually have programmatic uh functionality.

A

Environment use is obvious: after you've created your environment, you want to actually use it, and these are going to be the system environment variables that it creates. Some of these are used by k-mig directly. Actually, all of them are used by kmig and one way or another, and you see down there a shell process id that's because it's going to, of course create a shell for you to do this in.

B

So with the mig environment, there's one thing that we do on the evergreen side that we couldn't uh propagate on the go high side and that's you're using quahogis instance names here and that's where you do your migration, but on the postgres side, you're actually going to specify a migration schema and you could have more than one migration schema for the same migration. If you want to partition things a bit more, so there may be some, maybe some warts with this.

B

That kind of remember that legacy, and especially when it comes to terms like database and schema where they might be slightly different concepts. On the my sql side, meganet um yeah, there's actually a directory called sql init. We can put tables that get invoked when you call meganet.

B

So this is spelling out the the tracking column we use and a some other file. Some other things are a bit more experimental but are still there base staging tables which mig-link can biblicate on the fly, but we go ahead and pre-create uh the more common ones here, and then we have utility function, stored procedures that are useful for manipulating mark and strings and things you might find in legacy. Data.

B

And you only call that once make quick make quick, is a wrapper around a big ad and there's another make tool called icon b, which we don't actually use that much that's kind of legacy. uh Mig clean. um We actually probably should have created a slide from it. Clean make clean as a wrapper around the clean csv tool in the migration tools.

B

Repository and clean csv will actually parse the csv file and if it finds errors, if it can't parse something it actually brings up an exception for you to handle there on the fly and then once you handle it, it remembers how you handled it and it. Furthermore, it can actually apply that fix, based on matching patterns to other rows that remain in that file. So it's a very useful tool, but yeah make quick.

B

Just makes it more quick.

A

Yeah and one of the things to say about the csv clean that is extremely useful. Even if you don't learn some of the advanced functionality is by giving it a file of headers. It will know how many columns should be, and it will check each row for that number of columns and if it can parse the number the rows correctly so extremely useful.

B

Yeah, so you can specify the header separately in its own file or it can just read the first line in the file and assume those are the headers and you can specify how that works, and there are other config options for telling clean csv what the delimiters are and the escape characters, and that sort of thing.

A

uh Mig link I mentioned earlier making it super easy to say that an incoming file would be associated with a table in the database. This is what does that. So, let's say you have a m items table.

A

You have an m items table because you're going to be putting items in the database simple enough right and then you have an items tsv from your legacy system that you're bringing over um from whatever vendor it doesn't matter, and you want to bring that into a whole bunch of l columns, and maybe I mean let's say this is a very robust system and it's got 60 columns. That's really tedious to do it all by hand, but with mig link you can tell mig hey.

A

This file is supposed to be associated with this table. So what mig is going to do is go through the that list of headers, whether they're in a separate file or on the first row. It is going to remove spaces, put an l in front of them and use those as definitions to add on to the m items table and when it creates the sql file for staging.

A

It knows to load the columns from items.tsv into those l underscore columns and m items, so it just takes care of all that, for you super easy and you can see the mig convert command there, where you follow up from mig link and you actually create that sql file anything additional. There, jason.

B

Yeah everything he said happens, but most of the work is actually done by may. Convert and mig-link is just very simple. It just records a reference uh dissociation in the tracking table.

A

Yeah- and we probably should have made mig- convert a separate slide here, but because you can use mig convert directly and by doing that, it'll just get loaded into its own table that you specify say items underscore tsv underscore from someone. I don't I'm not very good at making. You know clever table names, and, but you do see down here from the output of mig convert the writing m underscore items.stage.sql.

A

That is the actual sql file, that's created for staging, and you don't even have to do that. Manually.

A

Mig can stage that for you and run that sql directly into the database for you, so you can take literally a collection of files once you make sure they're clean with the csv clean tools and that they're actually good for staging just a couple of commands, we'll get them all in there for you in a format with good conventions that are very sanitary for doing good data work, and that brings us down to mig quick sheet which is kind of the oddball in this process, because it's not really about staging data.

A

It's about analyzing data, but these tools are about data work in general and everything you need for a project, and this is an example of taking a file, this items.tsv and creating an items.tsv.mapping.xls because jason- and I both like really robust names that tell you every single thing that went into making this file and it will go through and analyze these columns to give you useful data afterwards, such as here's the legacy column, here's how many rows actually have data in them from that file. So in this case there are 19981 rows.

A

Every single one of them has a line number how what the minimum value in it is. What max value is? We didn't have room for all of that here, so we put some ellipses on the end and just but imagine a whole bunch more columns with data about what's in there.

A

And here's a little bit more information that will be in another part of that xls file that excel spreadsheet, that's created and examples of legacy values that are in there. So here you can see that in this column, that's being analyzed, you can see that two of them say book club, 29, say in repair, 8, say 2b withdrawn so again, stuff that allows you quick analysis without having to manually write each query.

B

And, like rogan said it's great for analysis, but it's also great for seeding a mapping document that can be used by the libraries for mapping their data.

A

Next up is kmig reporter one of the common needs that we have as we do. Things is reports on data and people not only want reports, but they actually want them to be pretty, and things like that, so that's where kmig reporter came from. We also want to be able to store reports in something that's friendly to use and get what kmig reporter does. Is you pass it a title? There are some optional ones you can pass as well. You can give it the name of data analyst, if not it'll, just say data analyst.

A

You can pass it uh some external files to use an introductory section and things like that, and then it will go to a stock collection of reports that are in a cohab xml file, run through those and give you output and that output is going to look a little bit like this. It is an ascii doc file, so those not familiar with ascii doc, it's kind of like markdown.

A

It is a formatting language. It happens to be used in the evergreen community for documentation, so that was the purpose behind choosing it. If that hadn't been the case- and it wasn't historical- I probably would have done it in markdown to be honest, but it gives you an s doc file for your reports. That's very easy to throw into get there's also plenty of tools out. There that'll convert this into nice, html or pdf for people.

A

There are also the capability for this to support non-stock xml report files. So if you have a custom data reporting need, you can write a new set of reports and put them in an xml file and when you run kmig reporter just point it at that and run those custom reports, you don't have to do the stock ones. You don't have to change the stock file. You can point it at some totally different collection. Next up is bib stats.

A

As I said before, this is just a sort of oddball collection of information when looking at a mark file, so a very standard way for me to start a data project is to get a mark file from somewhere and people to say hey. This is supposed to have this number of bibs in it, and it's from this source and here's what we know about it and my starting position is to trust but verify you know so I'll run it through this and have it say how many bibs are really in it.

A

What does the zero nine in the leader say? um Does it at least think it's unicode well, of course, it'll depend on whether the source system actually enforced, that or not more often than not. The statement in the zero nine is more of a hopeful declaration than a definitive definition. um What does the leader say? It is I obviously that's not a full breakdown. You really need this zero, zero, seven and eight for more information, but you know it's enough to give you a quick idea of what's in there.

A

um You know what are in some of the does it have 245 zeros does have 100 zeros, because I'm checking to see if there are authorities are there 856's that might be indicative of overdrive and things like that, and then a little bit of quick and dirty holdings analysis. This does not definitively say what the holdings are, but I look for holdings that have certain formats.

A

You know: might it be tlc holding symphony coha? If so, how many unique values do I have where the barcodes should be? This helps give me some information so for in this example, uh cohan symphony both seem like decent chances. Tlc, probably not only one unique barcode among 371 holdings.

A

That's probably that that tag is there for some other purpose.

B

Yeah, okay, help you identify cruft, so mark regards tend to go through a lot of sources.

A

And then I mentioned before: okay make export and import. This is an example of me running kmig export- and here you see some of the stuff that it's sending out and I probably should have a more sophisticated term here than stuff, um but authorized values. Book sellers budgets, borrower, attributes, calendar circles, item types- I didn't put it all here, there's some more at the bottom, but what it's exporting is not just a dump of those tables.

A

It's actually a little bit more involved than that and sensitive to what the data is. So, for example, some of these tables have a sequence that is actually used and if you're going to pull those tables into another system, you need to make sure, before you truncate data out of that previous system, that it's going to obey the new sequence rules, so that information is in there also for some of these they have to pull in data from multiple tables.

A

So these are artisanally. Crafted data sets, if you will, and then we have kmage import and k-meg import is simply taking everything that k-mag export does and following those same rules to bring it in. So if it has to reset a sequence, it does that if a table needs to be loaded before another table as a prerequisite for that data set, it does that all those sorts of things.

A

So it is sensitive to schema changes, and all of that brings us to. Why do we use it? Big picture wise. I mean it's nice to be lazy. That's a virtue of programming. According to larry wall. uh You want to not take unnecessary effort, but there are actually a lot of advantages beyond automation to why we use the kmig toolset, and some of those benefits include easier iterative testing and less churn.

A

We don't want constant change on database tables, that's not needed, and but at the same time we want to be able to change data frequently.

A

We also want easy migration, peer review and follow-up, and we need to be tolerant of schema changes and we want to be able to reuse code and it allows that for the same reasons, it makes it easy to do peer review and that kind of stuff. You have any additional thoughts on this jason.

B

Yeah there's a bit of the unix philosophy here, where we want small tools that do one thing we just happened to put the monitor that big umbrella yeah and the other thing is. We didn't want a program that tried to do everything with just the press of a button. So all these tools they produce artifacts and all those artifacts- are text and they're easy to go into get repositories or to diff and manipulate you can you know you can interject yourself, but between these milestones, uh workflow milestones do.

A

You want to talk a little bit about um you, know the etl and all that kind of stuff.

B

Yeah, so we call these data migrations, but other industries. They use the word etl a lot. Etl has some different connotations to it. It stands for extract, transform and load, and often that's for more for things that recur a lot. So it's like entire pipelines, where you're constantly moving from one data source to another. Migrations are like that, but you'll find even say the same version of a legacy.

B

Software system that libraries will use that system differently and they will open those fields and try to work around limitations in the system so that we do get code reused. There's this almost never goes without some editing needed some tweaks and yeah. We have tools for extracting data from these systems tools for munjukit and these are useful and for quahog and evergreen uh context and the ones we like. uh You know some of the other ones we kind of wrap and make.

B

But these are here, especially on the extract side of things.

A

Yeah and that kind of brings us up to stored procedures. You just mentioned those a second ago. This is also the point where I say that I wish we had schemas and mariadb, because I would love to have these sort of isolated away in their own migration name space, but we don't have that.

A

So our sort of compromise is doing the same thing that we do with tables and we put an m underscore in front so that it, while it's not terribly likely if somebody else creates an upsert data field, function, um we're not conflicting with names, and we create these and load these through mega knit like a lot of other stuff, and these are a combination of utility and just quality of life, things on the utility side. You have things like update leader, update 003 upsert data field.

A

These are manipulating bits of the mark and are just handy things to have around for manipulation. uh Sure you could do these in other ways, but it's just nice. Then you get to things like m split, string, m string segment count. These are just wrappers around more complex callings of things like substring and substring position. You certainly don't have to use these they're, not reinventing the world, but they're awfully convenient and make your code way more readable when you're doing a ton of string manipulations on legacy data.

A

Why have this big, convoluted, nested, strings of substring and substring position calls in order to pull out the middle name from a text field when you can just split it easily with the function after that? I wanted to say: kmig is not a static thing, we're constantly using it. I'm constantly looking for more things to add to the reporter and bib stats. For example, some things are fairly static and have been around a long time, but still even with those they receive occasional, tweaks and changes.

A

So this is a tool set, that's constantly in use and constantly receiving updates, and I welcome people to contribute to it. We make it available to the larger community uh because we want our work to be useful to other people, but we're also perfectly willing to take advantage of your labor.

B

So there are other tool sets out there. um We started this with evergreen, so this was a natural evolution of the evergreen tool chain for for doing migrations, uh we're not above cribbing from these other tools when needed, especially for extracting data from legacy systems, but a lot of times. They are following a different philosophy and they aren't using staging tables like we are so they're there um and we're not trying to denigrate them or suffer from not inventive or syndrome, but.

A

So at this point I want to make sure that everybody has a chance to get a hold of us of us if they want. If you have any questions, uh if you have feedback thoughts, anything you want. Here's. Our email addresses jason at equinoxinitive.org and myself our hamby at equinoxinitive.org we'll be glad to chat with you. You have any parting words jason before we sign off yeah all right, bye, bye,.