1 Introduction

Machine learning (ML)–a concept that incorporates various characteristics of intelligent systems that follow particular goals, have a formal representation of knowledge and an automated logical interference [1]–is expected to represent a huge leap from data analysis to high quality and efficiency predictions, and increases the value of informed judgements [2].

At the same time, ML has to address general issues such as fairness, ethics, robustness and explainability of trained models (cf. e.g. [3, 4]), in order to address legal, economic and political uncertainty. The existing uncertainty regarding machine learning exposes government practice to engineering and management challenges (cf. e.g. [5]) and applications have yet to deliver sustainable and reproducible results in the government domain.

Three noteworthy challenges are accountability [6], data sharing (cf. e.g. [7, 8]) and privacy preservation (c.f. e.g. [9]). Accountability represents a relation, where a government is accountable for ML to the users of its ML-based services by providing transparency means as well as mechanisms that allow them to enforce control. For instance, users should be able to resolve responsibilities in case of data and model bias (cf. e.g. [10]), to request changes to existing policies (cf. e.g. [11]) as well as to install third-party audits (cf. e.g. [12]).

The second challenge, data sharing in government, emerges out of technical issues such as interoperability and heterogeneity of data infrastructure, but also legal constraints and organization resistance. For instance, in governments where jurisdiction is split along federal levels as well as department competencies, the lack of data sharing is motivated by a missing legal basis and corresponding administrative procedures. Moreover, a common information model is often needed (cf. e.g. [13, 14]). The third challenge, preserving privacy, is a cornerstone of digitization and machine learning, and is of crucial importance for governments. For instance, GDPR in the EU and CCPA as well as HIPAA in the USA are legal frameworks that introduce extensive requirement for government information systems.

The objective of this paper is to provide an analysis framework for addressing the described engineering and management challenges in government based on an approach named Accountable Federated Machine Learning (AFML). In this context, accountability is focused on creating verifiable claims [15] towards trustworthy engineering of machine learning [12]. Federated Machine Learning (FML) is a novel approach to apply machine learning to generated knowledge based on shared models, but keep the data private at each participating party’s side during the training process (cf. e.g. [16]). FML allows parties to collaborate on one of the main challenges of machine learning: quality and quantity of the training data.

With the analysis framework based on AFML, we address the question: What engineering and managerial aspects should be considered along the process of introducing novel ML approaches in the government domain? We discuss an argument that standardization artefacts (e.g. business processes, models, shared terminologies, software tools etc.) are required in the course of analysis. To support this argument, we present findings from a prototype setup of AFML within a use case of citizen participation in Germany and discuss their implications. We believe that our research should be of value to both researchers and practitioners, given the current progress in similar domains.

2 Theoretical Background

2.1 Federated Machine Learning

The term federated learning was recently introduced by McMahan et al. [16]: “We term our approach Federated Learning, since the learning task is solved by a loose federation of participating devices (which we refer to as clients) which are coordinated by a central server.”. The real-world challenge addressed was primarily to learn from millions of devices (e.g. smart phones) by federating models [17]. Since then, interest in the research community evolved and included data-silos across organizations rather across single end-users. With respect to this development, a broader definition was introduced:

“Federated learning is a machine learning setting where multiple entities (clients) collaborate in solving a machine learning problem, under the coordination of a central server or service provider. Each client’s raw data is stored locally and not exchanged or transferred; instead, focused updates intended for immediate aggregation are used to achieve the learning objective.” [17].

In this paper, we chose to use the term federated machine learning (FML) instead of federated learning in order to avoid confusion among researchers outside the machine learning community. Thereby, we refer to the clients as FML parties and to the services provider as FML aggregator.

2.2 Accountability

Accountability can be generally defined as a “a relationship between an actor and a forum, in which the actor has an obligation to explain and to justify his or her conduct, the forum can pose questions and pass judgment, and the actor may face consequences” [18]. Accountability has been considered as a guiding principle for system design back in the 1960s (cf. e.g. [19]). In this paper, accountability is focused on creating verifiable claims [15] towards trustworthy FML [12]. In accordance with [20], trustworthiness from an engineering perspective can be represented as argument that aims at explaining the design of a system. Moreover, the argument should explain checks and tests performed during its development to ensure particular system properties. The argument should be organized around particular claims (or goals) and supporting evidence about the system. One can visualize it as a tree, broken down into claims and subclaims (the interior is the reasoning) with evidence at the leaves. To make the claims verifiable, they should be formalized and their evolvement (as well as of the supporting evidence) should allow for an audit trace (cf. e.g. [12, 21,22,23]).

From a management perspective, trustworthiness concerns assignable responsibilities, distributed obligations and corresponding rules and agreements (cf. e.g. [24]). Consequently, accountability can be presented in terms of: (i) compliance concerned with ensuring that activities are in accordance with prescribed and/or agreed norms (cf. e.g. [25,26,27]), (ii) control among the parties involved in a process that influence the conditions of the process (cf. e.g. [24]) and (iii) a regulatory framework that defines the requirements, goals, and completion criteria of a process in a satisfactory manner to the parties, who can build a consensus regarding their judgement on the completion of the process in a verifiable way (cf. e.g. [24, 25]).

2.3 Standardization

Standardization artefacts, representing among others a uniform set of agreements or specifications between the actors who develop and apply them, are a suitable approach to study the applicability of a technology in government [28]. In order to analyze IT standardization artefacts in government, the following framework that consist of two dimension can be applied [29, 30]. The first dimension includes three levels of interoperability and the second dimension includes five functional views. The interoperability dimension is structured along three layers. First, business processes applied in delivering public service are found on the organizational layer. Second, exchange of information and data as well as to their meaning between parties involved is found on the semantic layer. Third, data structure and format, sending and receiving data based on communication protocols, electronic mechanisms to store data as well as software and hardware are situated on the technical/syntactic layer.

The second dimension includes five functional views. The administration view includes predominantly non-technical standards. They affect personnel and process aspects as well as communication within or between public administrations. Second, the modeling view includes reference models and architectures, as well as modelling languages for each corresponding interoperability level. Third, standards that focus on the computation of data are included in the processing view. Fourth, corresponding standards for data and information exchange between different public administrations is handled in the communication and interaction view. Fifth, the security and privacy view contains standards that aim at addressing issues such as definition of access management policies, cryptography methods or requesting a required minimum of personal data.

3 Research Approach

We follow a qualitative analysis approach to explorative research. We aim at developing a descriptive artefact (analysis framework) that can be categorized as a theory for analyzing [31]. Our research approach is rooted in the paradigm of pragmatism [32]. We studied the findings through an argumentative-deductive analysis [33], which comprised theoretically-founded concept development and prototype development. We conducted a hermeneutic literature review [30] to study the theoretical foundations. Thereby, we started to develop our understanding of the concepts of accountability [19, 34, 35] and FML [16, 17, 36, 37] and derived implications from a standardization perspective [28, 29].

For our prototype, we used data and a list of challenges from a research project on online citizen participation [28], where missing data showed the potential of data sharing while at the same time data privacy concerns hampered any progress on sharing. Online citizen participation can be described as a form of participation that is based on the usage of information and communication technology in societal democratic and consultative processes focused on citizens [38].

The envisioned application of machine learning with a particular relevance for online citizen participation is a methodology named natural language processing (NLP). NLP has been already applied in government practice (e.g. [39]) as well as in online citizen participation (e.g. [40]). With respect to methodological development and practical tool availability of NLP, we decided to create a prototype architecture that involves classification of ideas sampled during online citizen participation and focuses on the following challenge: How to apply FML for NLP classification tasks that are traditionally performed in a centralized manner?

The data used for NLP in the prototype were collected during several citizen participation sessions in a German city. The dataset is a collection of citizen ideas –describing their problems, concerns and suggestions in German text form– including other information like title and category. Since the data originated from heterogeneous sources, several preprocessing steps were conducted to obtain a consolidated dataset: a collection of 3903 ideas with 8 categories, with stratified sampling into a training (90% - 3,512 ideas) and a test (10% - 391 ideas) set, and ultimately the training set was randomly split into 3 slices à ~1170 ideas each, simulating the three participating cities.

The prototype implements the use case as follows. Each city is represented by a party in the FML. The parties agree that an aggregator (as an independent actor responsible for the provision of the required technology) will manage the FML process and generate a global model based on local model updates from each party. Moreover, they agree on the set of test data that will be used to benchmark each updated version of the model. Each party can train locally the models in numerous rounds and store the required data locally. Technical components such as messaging and routing of the FML communication, storage of encrypted models as well as accountability-related rules and log data (verifiable claims and evidence about training data set, number of rounds, model updates, reached model quality, configuration of each party etc.) are stored on a cloud infrastructure. For running and orchestrating the FML process, we used the IBM FL framework [36]. The component for accountability rules and data (cf. for details https://git.fortiss.org/evidentia) is built atop a logic database named Datalog [41] and Hyperledger Fabric [42].

4 AFML Engineering

4.1 Feasibility Evaluation for FML

In order to structure FML approaches in a comprehensive manner, we provide the following overview (cf. Table 1) based on substantial extant research [37, 43, 44]. The overview has nine dimensions with corresponding characteristics. It is rather selective and does not claim to be fully exhaustive, since its sole purpose is to provide a starting point for evaluation the feasibility of engineering FML in the government context (for additional details cf. e.g. [45]).

The dimension data partitioning is characterized by the type of learning implied in terms of samples (data to learn from) and feature space (individual measurable property or characteristic of the data) to consider when labeling data. Horizontal implies that the data samples differ between the FML parties, but the feature space is the same. For example, departments with the same jurisdiction at different federal levels or at different cities share knowledge about user preferences about consumption of a public service. Vertical implies that the features differ between the FML parties, but the data space is the same. For example, departments with different jurisdictions share knowledge about relevant user attributes to create a more detailed user profile. Hybrid approaches concerns difference in both samples and features and are rather interesting in academic research, but have limited practical implications.

Table 1. Feasibility evaluation of FML.

The machine learning model dimension describes different types of approaches that can be applied in FML. It is important the profound understanding of the mechanics of these approaches is present at each party’s side, since the parties should agree on one approach and be able to justify potential attacks and malfunctions (cf. e.g. [17]).

The dimensions training data input and output as well as data federation are of particular relevance for FML in a government setup, when parties are organizations. This is the case, since cross-device federation typically addresses the challenge of applying FML to a large number of end devices (smart phones, IoT devices etc.), where challenges emerge out of scalability issues but the variety of data types is rather limited. In cross-silo federation, different organizations participate with different data. For instance, different public administrations have to agree on which data to use for building models and for which purpose. In this case, the involved parties should agree on the characteristics of data input and output and in case of heterogeneity, additional data pre- and post-processing should be introduced (cf. e.g. [44]).

Privacy preservation is typically addressed by either differential privacy or through cryptographic methods [12]. Basically, differential privacy aims at handling data in way that does not allow to reverse engineer any privacy-relevant information from the model or from queries based on the model. In practice this task is often challenging, due to the tension between robustness, fairness and privacy (cf. e.g. [17]). Cryptographic methods focus on securing that data stays private throughout the FML process, i.e. that computation is performed without revealing privacy related information. Practical and theoretical challenges include analyzing potential attacks as well as reaching the desired level of performance, since communication and processing effort is high.

The network topology could be centralized (e.g. aggregator) or decentralized (e.g. the model is aggregated among the parties in a peer-to-peer like setup). Practical implementations vary based on the type of data partitioning involved. A rather common topology is the centralized one combined with suitable privacy preservation.

Federation need represents a generic engineering dimension that addresses alignment with business and management. Economic incentives imply a clear efficiency or effectiveness need from the parties to collaboratively develop a model with corresponding privacy requirements. Needs that results from regulation are often derived from the legal framework and limitations regarding data sharing. Federation can be required from a mixture of needs, e.g. initial economic drivers and consideration of GDPR and jurisdiction legal norms among public administrations.

Technology grade implies the practical need to analyze and select a suitable technology stack. While research is active and ongoing, there are industry frameworks and service offerings available that take care of non-FML specific tasks such as authentication and identity management (cf. e.g. [17, 46]).

4.2 Architecture of AFML

Based on our prototype, we generalize the following architecture (cf. Fig. 1) for AFML and extend the feasibility evaluation for FML with focus on practical implications. First, parties in the government domain should be capable of applying machine learning. AFML does not provide a remedy for missing machine learning pipelines and, in fact, only adjusts the way machine learning is applied. Moreover, data preprocessing is still a traditional challenge to be mastered, and in the context of AFML this might lead to additional effort when defending against attacks and malfunctions (cf. e.g. [17]).

Fig. 1.
figure 1

A practical AFML architecture.

Second, in the context of government practice particular FML setups seem more feasible. We argue that data partitioning would be rather horizontal, data federation would be cross-silo, network typology would be rather centralized and considered technologies should be industry or near-industry grade. In the prototype developed, we gathered initial insights in support of our argument, since cities were willing to experiment but pointed out that a service provider should take over the FML setup and cross-device federation is currently lacking corresponding infrastructure and applications.

Based on our prototype and concept development, we also found that accountability of FML represents a potential enabler for accepting a novel technology. This is the case, since fairness, ethics and privacy are intensively discussed in the context of machine learning (cf. e.g. [3]) and, ultimately, trustworthiness of machine learning (cf. e.g. [12]), which is sometimes hampering public administrations to act.

From an engineering point of view, this challenge could be addressed with verifiable claims. Such claims should be defined regarding the architecture, data and data processing of FML. Examples include the framework configuration, data samples and features, model configuration and lineage along the training rounds, and, of course, continuous integration of changes (cf. e.g. [47]). Moreover, verifiable claims should be built upon corresponding authentication and ID management, as well as tamper-proof logs (cf. e.g. [21,22,23, 48]). Consequently, the additional feasibility requirement would be to introduce accountability and to make it accessible to a broad number of stakeholders through claims report.

5 AFML Management

Managing accountability for FML comprises transparent assignment and ownership of responsibilities based on rules and agreements about the expected results and obligations, facilitating to judge whether all parties have fulfilled their responsibilities. Moreover, it might comprise mechanisms to impose sanctions if obligations are not fulfilled, thereby enabling to distribute FML goals across multiple organizations.

5.1 Actors in AFML

In order to resolve particular management challenges, it is important to study which actors are involved in FML besides parties and aggregators. Based on our analysis, we distinguish between the following actors [47]: supplier (also referred to as producer), deployer (consumer), aggregator, party and auditor (cf. Fig. 2).

The supplier is the entity which owns the process of training a FML model. The supplier is responsible for prescribing global training parameters, the model architecture, the FML protocol and fusion algorithm, and specific data handlers. The supplier owns the trained FML model and provides it (e.g. through software licensing) to the deployer.

The deployer is the entity which controls the usage, risks and benefits of the trained FML model. The aggregator is responsible for aggregating the FML model updates provided by the parties, adhering to relevant supplier’s prescriptions, and for making the trained FML model accessible to the supplier. The parties are responsible for providing the aggregator with FML model updates, adhering to relevant supplier’s prescriptions. The auditor is an independent (potentially accredited) body which verifies and/or certifies that the deployment of the FML model (and/or the FML model itself) adheres to technical standards and/or applicable governance, risk & compliance (GRC) obligations.

Fig. 2.
figure 2

Overview of actors in AFML.

5.2 Trust Between Parties

Another aspect of addressing management challenges of AFML is the mapping of rules and agreements between the parties according to their responsibilities and obligations. Since the parties are different organization, from a management perspective FML activities takes place in setup where different organizations carry out activities jointly to achieve a common business objective. In particular, compliance challenges of linking regulation with governance, business processes and their execution across organizations pose threats to coping with responsibilities and obligations [27]. In order to cope with the latter, a corresponding level of trust is required. Achieving trust is challenging in the case of FML due to potentially unequal power relations (e.g. who provides the data, who is interested in the model) and the possibility of protecting the interests of individual organizations (e.g. what is a “good model”) by manipulating and controlling collaborative learning processes (e.g. attacks and/ or malfunction).

As a remedy, accountability in management of FML would include introducing verifiable claims that link (i) compliance with prescribed and/or agreed norms (cf. e.g. [25,26,27]), control mechanism for the parties involved in a process (cf. e.g. [24, 49]) as well as (iii) a regulatory framework. Such claims should represent the basis for operationalizing trust among the actors in a satisfactory manner. In particular, this would result in a non-repudiable consensus regarding the actors’ judgement on the completion of the process in a verifiable way (cf. e.g. [24, 25]).

Claim reports could represent a feasible artefact that operationalizes accountability. The purpose of such reports (sometimes referred to as factsheets [47, 50]) is to provide transparency and instill trust into ML services. They are to be “completed by AI service providers for examination by consumers” and shall contain “sections on all relevant attributes of an AI service”, in particular “how the service was created, trained, and deployed” [47]. Yet, it is still an open issue what level of detail and which claims are most suitable for a report to adhere to a required level of trust.

6 Discussion and Conclusion

In this paper, we formulated the need to face challenges of accountability, data sharing and privacy preservation along the course of machine learning in a government context. We address this need by introducing a novel approach named federated machine learning, which relaxes limitations of data sharing and privacy constraints. Moreover, we address the formulated need by introducing accountability from an engineering and management perspective towards generating verifiable claims for FML.

Through an argumentative-deductive analysis of literature and a prototype of AFML for online citizen participation, we explored AFML from an engineering and a management perspective. The engineering perspective includes feasibility evaluation of FML and adds an accountability perspective based on a corresponding architecture for practical applications in government. The management perspective includes an analysis of actors involved in AFML and means to establish trustworthiness between them.

Based on this analysis framework, we approach the question of introducing AFML in the government domain based on the following overview of standardization artefacts (cf. Table 2, [7, 28]). Exemplary artefacts in bold show that substantial progress has already been made or industry-ready solutions already exist, which is promising for exploring improvements of existing approaches (cf. e.g. [39]). Regarding the other exemplary artefacts, the status is either an open research problem or the current solutions are for application in a research setup.

Table 2. Implications for streamlining AFML in government.

Our research has a number of limitations. First, the engineering analysis is rather general and omits details that might be of relevance for a thorough feasibility evaluation in practice, especially from a methodological perspective given the fact that we developed the prototype and used it as a basis for interpretation. Second, the presented architecture is focused on a cross-silo data federation. Emerging developments (e.g. smart city, edge computing) might pose the need for a cross-device FML in government, which is out of our research scope. Third, the management analysis was solely based on argumentation and deduction from relevant literature as well as prototype development, due to the novelty of FML and the limited access to suitable interviewees in the government domain for sampling of primary empirical data. Fourth, a limitation for ML in general is a possible security and privacy breach by reverse engineering a model, which might leak the data.

We believe that future research should build on our findings and address the describing limitations of our research. We strongly encourage researchers to explore potential use cases and to derive engineering and management requirements for AFML. We also believe that practitioners can directly benefit from the presented findings and apply them as a basis for exploring novel FML techniques to overcome traditional challenges.