The importance given to the theme Corporate Governance has evolved in recent years to include the ESG guidelines (Environmental, Societal and Governance), which translated in the AI Strategy, involves the Data Governance with a focus on addressing the biases and noise in data of the real world, in order to generate results that represent the diversity and variety of society in its entirety, mitigating and reducing the possibility of negative impacts on people’s lives.
With time of use, AI models will “detach” from the reality they were modeled to represent, this process is called “AI drift”, that happens due to a series of factors, such as the change in context and factors that weigh in the generation of a certain result. The world in this sense is fluid and the algorithms and models must adapt to this reality.
Data Governance is essential to clarify what data is used in the algorithms and to guarantee its quality, integrity and veracity. Also to avoid and mitigate any bias and errors at all stages.
For this to occur, it depends on the quality and variables chosen in the training data. And a good practice is to adopt the Data Governance process with Human Supervision.
The reality is that eliminating biases in algorithms is a difficult task, as the data is based on the complex and chaotic real world with human beings who also have their cognitive biases.
And in the Computer Science area, there is still the concept called “GIGO” – (Garbage in, Garbage out), where the quality of the input directly interferes in the output, in other words, no matter how good your model is, it will be as good as the quality of the data available.
That is why a great emphasis on the Data Quality stage is essential to ensure that machine learning models are reliable and can be used to guide business decisions.
For an algorithm to be “explainable” and “interpretable” it is necessary that all the stages of the machine learning process that resulted in a prediction, can be tracked and the variables that weighed in the decision making can undergo a scrutiny. This is still a flaw in models called “Deep Learning”, which lack these characteristics when compared to traditional models.
And there is still the factor that any and all interventions in the input data, it needs to be mapped in order to be able to “systematize” and automate these interventions in data formatting, so that they can be used in a scalable way in production.
Data Quality best practices for tangible business results
When applying Artificial Intelligence Applications for Business, I consider that one should always look for the best machine learning algorithm / model for each context of use that is unique, as they usually have their own particularities that distinguish it from the others. But a practice that is common to the most diverse projects and that can be controlled according to the effort to be employed and the results generated, it is the quality of the training data to be used in the models. And for that, adopting the best data cleansing and normalization practices is essential.
To illustrate what I mean, it is generally known in the market that about 60% to 80% of the effort of a Data Science project is spent on the Data Quality process, involving the cleaning, formatting and organization of the data, in order to that they can be standardized and used in machine learning algorithms. These efforts may vary according to each situation.
As it is an activity that demands a large part of the effort of an AI project, the adoption of good practices of organization, cleansing and normalization of data allows a more strict quality control process in the outputs of the algorithms, with the generation of tangible and high quality results.
For this reason, it is recommended that special attention be given to the Data Quality process, in order to make use of the scalable quality of AI to promote the improvement of the results generated.
Data as intangible assets
The greater the quantity, quality and diversity of data, the greater the number of examples of real experiences available, and therefore, the more complex the tasks learned and performed by these algorithms can be. Thus, there is a relationship between the training cost of the AI model and the degree of complexity that it will be able to solve within certain contexts.
Despite the cost, after the AI models being trained, they are able to address increasingly complex problems, with greater degree of flexibility in the range of tasks that can be done and as a result of it, the greater is the potential of value that can be generated through time. And making the use of a Data Governance Strategy, in a periodic revision process of the AI Models, it is also necessary due possible changes in the context of the problem, that could result in noises and errors in the outputs, thus resulting in more costs in order to correct it, if not prevented in a timmely manner.
Thus, the organization starts to act as a platform powered by data, being an exponential way of scaling operations based on algorithms.
The Collaborative Process and the Data Governance in organizations
Data is essential as inputs in the collaborative processes of generating ideas and prototyping new products in the experimentation cycles of organizations worldwide, that use data as intangible assets in the organization’s value chain to obtain scalable results.
The collaborative process must adopt a Data Governance Strategy, which is present through processes that guarantee their quality and consistency.
Importantly, the establishment of common definitions in the treatment of data for greater standardization for all applications that use them, since they will supply the organization’s collaborative processes and the decision-making process.
This process is vital for the organization, as it guarantees greater reliability, precision and ultimately the value to be extracted from the data.
Therefore, in an increasingly complex and uncertain world, data authentication and validation is an essential step in order to guarantee, in addition to the information security tripod (confidentiality, integrity and availability), also its veracity, thus avoiding “variations” in the truth of the data with the formation of information “silos” in each area.
In the process, also promoting the mapping of the different sources of information used, the consolidation and the greater consistency of the data that will be the inputs of the collaborative flows that will provide the generation of value in the organization.
And that, whenever possible given the conditions of the organization, make use of a central repository (Data Lake) that facilitates the process of consulting information from the various departments and that also seeks to avoid reworking data entry, automating whatever possible, thus ensuring the best use of employees’ time to focus on activities with greater added value.
Industry Data Governance
In industrial organizations, there are true ecosystems of information systems, structured in a complex chain of events that are interdependent, each with its degree of importance and dependencies, in which an error at some stage of a process, has the potential to cause major operational problems.
That is why, for everything to work in an integrated way, the different systems must have the necessary data validated, so that there are no errors during the execution of an order request to go into production, which would lead to a standstill with the delay of the entire production, and would cause losses for the organization.
Thus, the correct treatment of all the information in all its variability and complexity, with the creation of Data Governance policies, making use of the standardization of Data Quality techniques and the formatting of the data, it is mandatory so that the information can be considered consistent to serve as input for both the manufacturing process and the decision-making process, in order to extract the value in the form of actionable insights.
Risk Management and the use at scale of Artificial Intelligence
The main obstacle to the current and future use of Artificial Intelligence on a large scale in society is not technological, but the difficulty of eliminating the biases and errors in the results generated in the algorithms used, which are trained in data that in their origin may also contain these vices, which ends up creating an automation of decisions that can impact people’s lives and that raise a series of ethical issues that need to be addressed.
People generally are not even aware of these facts and with the new data protection laws in the world, the professional who works with Artificial Intelligence who wants to stand out, should be concerned with being able to make “clear” what premises and factors were taken into account to generate a result of the machine learning algorithm, since with the automation of decisions increasingly in vogue, there is a greater probability that may have an impact in the “real” world and possible real side effects, thus being of great importance the creating of a Risk Management Strategy, so that they can be mapped, correctly measured and mitigated when necessary and finally corrected when identified.
In other words, the ability of a human operator to be able to explain and interpret the results generated by the algorithm in the context of the problem in order to avoid bias and errors, I believe it will be a major factor for its use to be disseminated at scale, specially in areas that handle with confidential and sensitive data, such as the Healthcare area and government services.
The use of strict Data Governance criteria in a mandatory manner in all stages of the machine learning process that resulted in a prediction, will help to mitigate and circumvent these problems, helping to generate confidence and the impartiality of the results generated, always with the supervision of human beings in the process.