IPFS IPFS Camp 2022 - Decentralized Science (DeSci), 30 Oct 2022

Previous Meeting

⏯

youtube image

►

From YouTube: The Future of Privacy-Preserving Compute & DeSci - Alexandra McCarroll

Description

This talk was given at IPFS Camp 2022 in Lisbon, Portugal.

A

um The talks have been amazing and there's some crossover, which is really cool, especially with like the Data Trust stuff that Kelsey spoke about um so I'm Alex I'm, the co-founder of phase three, a web, three innovation consultancy for the first half of the year, I worked on a project called links. We looked at enabling privacy, preserving storage, sharing and Analysis of sensitive data, with a focus on biometric data, from wearable devices to amplify user into insights and catalyze Science and innovation.

A

I will talk about some of the insights from that project today. This session aims to be high level, focus on both theory and practice, and I'll leave some time at the end for questions.

A

So I want to start off with a story of something that recently happened to a close friend. He was freshly graduated master of engineering and the ink was barely dry on his degree when he began pulling together all the work for his portfolio and he finds out two months later. The university has deleted it all off of their servers, locked him out of an account everything his six years of work, which he trusted the institution to store, is vital for both building his portfolio and actually patenting his final year project.

A

This has also been deleted too expensive for them to even store for another day post graduation now he's spending days, emailing past tutors coursemates, trying to find things that have been squirred away on email servers, download, folders and personal hard drives. This story is not unique. He, along with thousands of other students and researchers, trust the storage provided by universities to keep their knowledge safe.

A

Like we used to trust income paper in a library, the value of data should not diminish, as it becomes more prolific. In fact, the growing demand for vast amounts of data should be met if we want to create better AI.

A

So, in summary, problem one is the current is that the current landscape lacks the culture and infrastructure to supply the need for more and more data.

A

So, let's first dive into what is meant by privacy, preserving compute also referred to as privacy, preserving machine learning or AI privacy. Preserving computation requires the derivation of insights from data without ever having to see or share that data. This is enabled by cryptography, distributed computation and verify verifiable privacy and governance.

A

Let's look at where's, Wally or Warder in America. In this image we need to find Wally a man dressed in a striped top and hat.

A

For this example, we only need to know that does Wally exist and where are his exact coordinates. This is a crude example of a data to commute compute model. This is where we, as a data analysts, are looking for Wally by looking at all of the data in order to determine these two points with privacy preserving compute. Instead, we use methods that obfuscate the data that is not necessary to determine the answers we need.

A

This allows for analysis of data by parties who do not have the ability to view the data but would benefit greatly from access to it.

A

Privacy, preserving compute, is a vital building block for DSI for contacts, half of or FDA approved, medical devices based on AI or for radiology, and these are trained on no more than a thousand images for comparison. Dali is trained on hundreds of millions of images, and most doctors see between seven to eight thousand patients a year.

A

This causes a problem as these AI models only are only being trained on a narrow subset of the population, leading to a misrepresentation of those who are excluded from the data set privacy. Preserving computation is already being done without a blockchain, with some of the most prominent work being pioneered by organizations like openmind.

A

The NHS actually also used a similar approach during covert where they allowed independent researchers to send algorithms to the NHS dataset to run against their database in order to speed up the process of discovery.

A

uh So with distributed computation, we have the ability to deploy machine learning models to get meaningful Knowledge from geographically distributed large-scale data. We see that with baklav several well, several architectures exist to be able to do this. Privacy and security have not been sufficiently addressed and existing models are vulnerable in their architecture and have efficiency limitations.

A

So we have this Paradox between the need for more data than ever before, and the requirement by private data to remain the property of its owners. So how do we address this?

A

So I want to talk about what is privacy, because there's like a some definitions of this, um so I want to look at this in the context of hellenism's, contextual, Integrity theory of privacy, which defines privacy as appropriate information, appropriate flows of information where the appropriateness is defined by the context and it's contextual in informational norms.

A

So contextual Integrity is a paradigm shift away from Fair information practice. Principles which emphasizes a model of privacy focused on the control of personal information.

A

Framing privacy through this lens of control, has led to a notice and consent model of privacy, which many have argued fails to actually protect our privacy rather than focusing on the definition of privacy as the control of private information, contextual Integrity focuses on appropriate flows of information relative to the stakeholders within a specific context, while trying to achieve a common purpose or goal.

A

This definition of privacy is much more nuanced and captures all kinds of edge cases. For example, is your genomics data strictly yours when you get your Gene sequenced, that data also belongs to your ancestors and to any children you have or may have. This is just one example of why the issue of privacy, preserving computation, is not just a technical question, but also ethical.

A

So, as mentioned previously, privacy, preserving computation involves a combination of different cryptograph cryptography, distributed computation and verifiable privacy and governance methods.

A

Federated learning is an interesting model, as this allows for models to be trained collaboratively by distributed nodes. In theory, this model is privacy. Preserving. However, there is evidence that Federated learning language models can be reverse engineered to reveal private information. Therefore, more research needs to be done to truly establish privacy, preserving computational methods.

A

So through our research with links, we identified four ways to theoretically manage this data in this in a decentralized way. The first is using web 2 tools. There are various architectures to approach this problem. This way, however, the system cannot be truly decentralized, as there's always going to have to be a trusted intermediary between different parties to maintain the system.

A

The second, we looked at a completely open, decentralized approach. This involved encrypting sensitive data with cryptographic, Solutions such as homomorphic encryption and storing that data on the ipfs public network. Risk of this include when the encryption standard is broken. This sensitive data is no longer private.

A

There is an argument that you could locate the data by the CID, the content, addressable IDs and therefore re-encrypt the data um in with the new with a new algorithm. However, there's a risk if the node that storing this encrypted data with the old encryption mechanism goes offline, then you'll never be able to with a hundred percent accuracy re-encrypt that data. So, therefore, this is not a viable option, because there's no guarantee of the right to be forgotten.

A

At first glance, this method seems completely ideal, but on closer inspection as many shortcomings making it not viable in the near term. Also. Another issue with this sort of method is because the data is encrypted. It leads to very slow computation.

A

So, third, we looked at a hybrid approach. This involved storing sensitive data on centralized databases with pointers to the on to these databases stored on the blockchain. This could be governed by intermediaries or trusted data unions or data trusts. Although this method still relies on some centralization, with the risk of the data Union becoming so big that it controls the entire network, there could be rules put in place to avoid for this from happening.

A

We concluded that this is probably the most viable of all approaches in the near term as it decentralizes the control of data, but does not have the same risk as a fully decentralized approach.

A

Last we toured with the idea of a private Network approach. We use ipfs's private Network as inspiration for this. What this could look like for DSI is multiple institutions such as universities and research, centers deciding to create a private distributed Network between them to enable the sharing of sensitive data without the risk of the nose going offline, as this would be part of the legal agreement between them.

A

So what next? From our research, the biggest issues lie in the gray area of data ethics. This is The Balancing Act between doing and being paralyzed by the what-ifs. Successful teams will be interdisciplinary and research still needs to be conducted, which develops appropriate ethical standards of how the models can be trained but and how data is handled. This could look like Global red lines, however, with the world looking how it does now. This may not be the most appropriate approach.

A

There's a this is an area where data unions could add tremendous value rather than compile rather than complying with your sort of geographical location of data privacy. You could join different data unions, which fit closely with your values. For example. Maybe you want to share your data with all research projects, so you would join a debt Union that does that or maybe you only want to share your data with like a research projects that are focused on X, but you don't want to do it for like something else. You could join a data Union.

A

That um does that. So. To conclude, this is an incredibly nascent area and if you want to discuss this further, please get in touch. We have an Ethics telegram, Channel focused on human data in web 3, which is open for anyone to join. You can message me on telegram, which is the same: the top handle and I can add you, and there are lots of projects working on this problem, which I can direct you towards and yeah.