Introduction
Artificial Intelligence (AI) is a wide-ranging powerful tool with the potential to radically transform our lives and, as such, it is important to regulate it as a way to maximize its benefits for society.
One noteworthy regulation effort is the European Commission’s ethics guidelines for trustworthy AI. In fact, this outlines a set of key requirements that AI systems should follow to be deemed trustworthy. At Traent, we believe that blockchain technology is an ideal candidate to fulfill these requirements, ultimately leading AI toward more trustworthy ends.
In this post, we will show how an entire Machine Learning (ML) pipeline could be operated on Traent Blockchain. This technology has some specific properties such as transparency, auditability, and accountability.
To provide a concrete example and enhance comprehension, we will construct a toy model for a potential use case in the healthcare domain. This will be created by leveraging Traent Era, the first collaboration platform based on blockchain, and Era Workflows, a powerful smart contract engine capable of executing data-intensive tasks like ML training.
We will simulate a scenario involving the secure sharing and tracking of patient medical records, the training of the AI model, and its execution on unseen data by using Traent architecture. This enables the integrity and confidentiality of the training data while increasing the transparency and accountability of any operation on it.
Traent Era
Era is a collaboration platform built on Traent hybrid blockchain technology. Traent combines the best features of public and private blockchain. Traent combines the best features of public and private blockchain, effectively addressing enterprises’ key challenges when implementing blockchain technology, such as privacy, cost, and time-to-market. Furthermore, Traent has developed user-friendly tools and interfaces specifically designed for users who may not be well-versed in these technologies. This approach tackles the issue of poor user experience that often plagues traditional blockchain solutions.
Era allows creating, managing, and sharing Projects with colleagues, business partners, and external parties with granular permissions.
A Workflow regulates each Project and can contain the following resources:
- Data streams
- Documents of any type
- Tags (useful for labeling)
- Threads
A blockchain is created for each project, to which all users involved in the collaboration have access. This approach provides many advantages:
- data uploaded to the blockchain is kept private;
- blockchain’s blocks are not widespread but are only accessible to organizations involved in the Project;
- a blockchain (or a portion of it, hiding data of choosing) can be exported (copied) and is auditable externally;
- data uploaded to the blockchain is tamper-evident. Some data alterations are allowed, particularly selective data redaction, but such alterations are evident to auditors.
Era Workflows
A Workflow is a smart contract that regulates a Project.
The Workflows are composed of:
- states showing specific steps in the process of managing Project resources, with operations on such resources limited by constraints;
- transitions between such states, regulated by conditions that define the requirements for the transition from one state to another;
- actions that may be performed before or after transitions.
Projects and Workflows can be seen as Business Process Management (BPM) tools, but actually, they are much more.
Traent’s Workflow Engine leverages Google V8, a high-performance JavaScript and WebAssembly engine. Our Workflow tooling includes a standard library and ready-to-use Workflow templates. These allow using TypeScript and facilitate developing Workflows using any language by targeting WebAssembly.
Allowing Workflows to use WebAssembly modules, we can embed and use Tensorflow JS library to develop ML models natively inside Workflow’s environment.
Anyone can also create workflows and their constraints through an easy-to-use visual editor. This allows non-programmers to effortlessly design Workflows by creating and positioning states and the transitions between them by drag and drop.
A Healthcare Use Case: Using Machine Learning to Detect Early-stage Cancers
We want to train a machine learning (ML) model to detect cancer from a patient’s radiography (X-ray) scan. The model should label an X-ray image as “suspicious” or “unsuspicious” based on its training. To enable trustworthiness, we wish to take advantage of blockchain technology. Indeed, we create a Project with the following Workflow:
There are three states in the Workflow: Loading, Training, and Inference. In the Loading state, a user loads an ML model to be trained, along with training data. The Training state allows the model to be trained on the data until the result is sufficiently accurate. After that, the process moves to the Inference state.
In the Inference state, the model receives an unseen X-ray image and outputs whether the scan reveals suspicion of cancer or not.
Arrows represent workflow transitions. Unlabeled arrows represent user-triggered transitions that advance the process to the next state. Labeled arrows represent transitions triggered automatically by the Workflow when something happens (e.g., when an X-ray scan Document is loaded into a Project, a classified transition is triggered).
Problems that may arise
Cancer detection from X-ray images is a sensitive task, and using AI can significantly improve the field. However, doctors and patients need strong guarantees before trusting such a tool.
To list just a few trust problems that emerge in this situation:
- due to its sensitivity, data should be kept as private as possible;
- each training image should have its access policy;
- a doctor should know that if they correctly use the AI tool, they won’t be liable for complaints;
- an audit trail of all the operations performed during the model training should exist. This is useful to prove that it went as expected.
In the following sections, we show in greater detail how a Workflow comes in handy to address these problems.
Certified Training Set
The first step of the process we want to model consists in loading data into the Project. In particular, we load a Document containing the parameters of the ML model to be trained. We also upload a Document that represents the training dataset (a set of X-ray images and, for each image, a label “suspicious” or “unsuspicious”).
We modify the previous Workflow diagram and split the transition loadData into two separate transitions. Additionally, we add a new state called Data Editing, where images can be removed from the dataset. Here is the updated diagram:
The magic begins when a participant uploads data into a Project. Under the hood, data is written into the blockchain and thus becomes certified! Once something is written into the blockchain, it becomes tamper-evident.
This is a crucial feature for trustworthiness. No need to trust the data or parameters used by those who train the ML model: this information is already certified on the blockchain!
Until now, we talked about storing data into the blockchain to keep things simple. For each Project, data is stored in a Traent Ledger. This is, roughly speaking, a chain of blocks along with an off-chain storage. From now on, we will be more precise and refer to the layer underlying a Project as Ledger instead of blockchain.
Adding Metadata to Improve Access Control
We have built a Workflow that executes an ML pipeline directly on a Ledger from data loading to inference! That’s good for sure, but we can aspire to even more. In a trustworthy setting, training data should be as private as possible. Think of the GDPR regulation (in particular, article 5.1-2): among the other data protection principles, there is a request for:
- purpose limitation: data must be processed for the legitimate purposes specified explicitly to the data subject when data was collected;
- data minimization: should be collected and process only as much data as necessary for specified purposes;
- storage limitation: personally identifying data should be stored for as long as necessary for the specified purpose.
To enforce these principles, the Workflow should choose training data from the input dataset according to specific permissions for each image. We want a metadata Document containing these permissions to be loaded on the Era platform.
For each image of the dataset, the Documents specifies:
- a list of the authorized purposes for the image;
- when the image was first created;
- if the image has a limited timespan of legitimate usage.
We update the Workflow diagram by adding the transitions loadPermissions and updatePermissions. There are for uploading and updating the metadata Document for the dataset. We also add a new state called Filter. In this state, the Workflow filters the dataset by excluding images that don’t give explicit permission for that training or whose usage permission has expired.
Usage note
Notice that not only images can be filtered out by the Workflow according to their permissions policy. Indeed, these can also be removed manually (or automatically, using external tools) by participants with the appropriate rights over data. Combining these features makes it possible for a participant to delete an image from the Project (at a lower level, remove an image from the Ledger), keeping on the Ledger the reasons why the image was removed. For example, a participant could remove an image from the Ledger that has exceeded its legitimate usage timespan and leave on the Ledger the portion of the Document that testifies that the image expired. The participant leaves on the Ledger a proof that the image was removed according to its access policy and not arbitrarily.
If required, the Workflow could be instructed to delete the models trained on a dataset that included the deleted image.
Collaborative Annotation of Data
Rather than loading a pre-labeled dataset into Era, it’s possible to load unlabeled images and label them directly on the platform. Using the Project’s native Tag system, users can collaboratively assign Tags to any kind of Document. These include single images and thus can label a dataset on Era. This seemingly simple feature is very powerful, enabling the platform to produce a tamper-resistant audit log for each action performed on the data, including who acted.
For example, a group of doctors could label portions of data in parallel, reducing the overall labeling time and taking responsibility for the evaluations performed over the training X-ray images. We refer to this strategy of tracing each labeling action to the user that performed it as accountability of annotation.
We can even imagine a situation where there are two layers of Tags: one to identify training images as “suspicious” or “unsuspicious,” while the other to mark manually labeled images as “reviewed” by a participant to the Project identified as the labeling reviewer. This way, not only do we achieve both accountability of annotation and review, but we could also allow the Workflow to filter away images not reviewed yet automatically.
Verifiable Transformations of the Dataset
In machine learning, data isn’t typically fed directly into a model. Instead, preliminary operations are performed on the dataset, such as normalizing images to improve the model’s performance and stability during training.
These transformations can be carried out by the Workflow and executed directly on a Ledger. The log of these operations is also stored on the Ledger, providing verifiability.
We can update the previous Workflow diagram to include dataset transformations. Here is the updated diagram:
Audit of the Data
The use of the blockchain makes an ML pipeline easily auditable. Each step of the Workflow is referenced on-chain: the loading of the dataset and the ML model, the manual tagging procedure, the data preprocessing, the permission evaluation, and the training of the ML model (for each training epoch, the Workflow can write a Document containing the updated weights), the inference, the updates to the dataset.
Furthermore, the combination of auditability with data integrity and accountability properties ensured by the Ledger results in a complete data lineage for each data involved in the ML process.
Thanks to the Granular Data Disclosure feature and the hybrid nature of the solution, the data is auditable to private blockchain network users and external auditors.
How to do it
When the data needs to be audited by a third party, an export of a subset of the data contained in the Ledger is created using Era. It’s possible to select which resources to exclude when creating a Ledger export (selective disclosure of the data); thus, Era allows the users to choose the level of privacy desired for one’s export in a granular way.
You can store the result on-prem or in the cloud, share or publish it online, and finally, view it directly in the browser using the Traent Viewer component (you can find its source code here).
The flexibility of export-based data auditing turns out to be useful in a lot of situations, such as:
- a hospital can share its (trained) early cancer-detection ML model by exporting the Ledger and excluding the private data of the training dataset;
- a patient whose data was used for training can check whether their data is used as expected;
- an independent entity could audit every step of the ML pipeline and certify conformity to some given standards.
Verifiable Inference Results
Another peculiarity of the Workflow trained ML models is that inference results are more than just obscure outputs of a black-box system. Instead, by exploiting the data lineage property of the system, it is possible to prove whether a given inference output descends from our Workflow-powered model. This property is crucial in a handful of different situations.
One can consider an oncologist who is aided by a worklow-trained ML model and has access to a sufficiently expressive design export. When they use the model to classify a radiographic image of a patient, they are assured that the classification result is derived exactly from that model. This is because the tool he is using has been trained with appropriate data. Doctors can demonstrate their diagnosis was motivated by a tool allows sharing responsibility for the people involved for each choice of the ML model.
Conclusions
In this post, we have seen how deploying an ML pipeline on Era and Workflows exploits blockchain nice features to increase AI trustworthiness.
At our company, we believe in harnessing the potential of AI while prioritizing trustworthiness and transparency.
We invite you to join us on this exciting journey toward building a more trustworthy and understandable AI ecosystem. By partnering with us, you can play an active role in shaping the future of AI and contributing to the development of responsible and ethical AI applications. If you’d like to get started or learn more about how to interact with us, please visit our website or contact our team. We look forward to collaborating with you and making a positive impact together.
Authors
Fabio Severino, Andrea Pelosi, Claudio Felicioli, Andrea Canciani