Organizations these days are very dynamic with their distributed workforces and applications. As these companies grow, their IT teams must perform the increasingly complex task of user access management.

Security professionals define access control policies and controls that mandate user access review on a periodic basis. The user access list is sent to application owners for review, and user IDs are disabled or deleted as necessary. The user access review control helps ensure that unauthorized users do not continue to exist in the system if the user ID is deleted during the normal offboarding process.

However, if user access reviews have not been carried out diligently or organizational changes impede access management, the user ID deletion process becomes quite complex. For instance, if a large multinational corporation has been growing through mergers and acquisitions, the integration of user access systems is complicated and cumbersome. Statutory audits in these organizations often reveal noncompliance related to user access management.

The Tedium of Manual User Access Review

Organizations set up projects to clean up unauthorized users IDs in the system, and the initiative gets driven through regulatory requirements or the company’s security policy. Due to its manual nature, this process requires a concerted effort and ample resources.

Such a project typically involves the following steps:

  1. Generate a user ID report from the relevant systems.
  2. Send reports to application and system owners for review.
  3. Set up meetings with owners and stakeholders.
  4. Disable or delete user IDs based on feedback from owners and stakeholders.

For large organizations with many systems and applications, and potentially thousands of users, the user ID cleanup process is very tedious. A machine learning algorithm can help security teams streamline this process to make decisions more efficiently.

Using Machine Learning for Faster Decision-Making

User ID cleanup projects are effort-intensive and require complex interactions between various stakeholders, increasing the cost of performing these activities. Administrators have to interact with many different stakeholders to know which user IDs have to be deleted. Often, application or system owners fail to provide input at the right time, putting the organization at risk of potential unauthorized users. This also contributes to noncompliance with regulations and security policies.

Machine learning algorithms can help security teams model user ID disabling activities. These algorithms do not solve the complex problems related to user access review, but they aid in decision-making, which is a critical need in such projects.

The process begins with the identification of user ID environments for model building. User ID information is generated in the form of raw data, which is then prepared and normalized for logistic regression. Finally, a model is developed and results are analyzed accordingly. The graphic below illustrates the steps of this process:

Figure 1: The user ID cleanup process

Preparing for Model Building

Organizations can have thousands of user IDs with heterogeneous environments, so cleanup projects may require building many models to reflect that. For instance, user IDs in development, production and test environments might each fit into different models. However, the analysis performed in the example below considered development, production and test environments in the same model. The best models have to be determined and should be based on a number of factors, such as organizational unit, geography, type of application and type of environment.

Generating Raw Data

Raw data is generated based on the agreed-upon model and is usually maintained in spreadsheets or exported through a central user ID management system. The raw data, which can include details related to users, servers and applications, will depend on the models selected during the previous stage.

For our test, raw data is generated for about 3,000 user IDs and sanitized for logistic regression. This data includes the details outlined below:

Area

Details

User

Account type (shared, privileged, third-party, etc.)

Server

Operating system

Application

Application name

System Environment

Production server, test server, development server

Scroll to view full table

Table 1: Raw data

Sampling

A random sample of 348 user IDs is taken to build the model, which is then applied to the rest of the user IDs. The sample user IDs are then discussed with application and server owners to determine whether they should be deleted. This introduces a new column labeled “ID to Be Disabled,” which serves as the dependent variable. The new column is populated based on discussions with the application and server owners.

Data Preparation and Variable Selection

The information gleaned from the application and server owners enables the security team to select the independent variable. For our test, we identified the following variables:

Area

Variable

Number of Variables

User

Shared, privilege

2 variables

Server

Operating systems

5 variables

Application

Application type

11 variables

Environment

Production, development and test

3 variables

Total Variables

21 variables

Scroll to view full table

Table 2: Variables

The variable identification activity is also effort-intensive. A high number of variables suggests that many different environments could exist, which may require more than one model for decision-making.

The variables outlined in table 2 are in text format and must be prepared for logistic regression. The textual information in the variables is then converted into binary form.

Results and Analysis

The prepared data is fed into logistic regression to obtain the following model with 21 variables:

Z = -1.611 + 23.245*Privilege + 23.298*Shared- 21.767*OS1 – 20.092*OS2 – 17.621*OS3 – 21.089*OS4 – 21.027*OS5 – 23.029*APP1 – 1.317*APP2 + 0.148*APP3 – 2.742*APP4 – 20.417*APP5 – 24.350*APP6 + 1.383*APP7 – 20.627*APP8 + 23.187*APP9 – 4.8328*APP10 + 21.445*APP11 – 0.484*Production + 0.233*Development – 3.412*Test

The probability of disabling the user ID is calculated using the following formula:

Probability of disabling user ID = exp (Z) /1+exp (Z)

The model is then applied to rest of the user IDs in the organization to determine the probability of disabling the user IDs. At this point, decision criteria are required to help the security team determine how to take action.

Accuracy of Prediction

The model has an accuracy rate of 86.2 percent at a cutoff of 40 percent, as shown in table 3 below. The model predicted 300 out of 348 user IDs accurately.

Table 3: Model accuracy

Variable Significance

Table 4 shows the significance of different variables. Of the 21 variables, only the four — APP4, APP7, APP10 and TEST — are significant to the model.

B

S.E.

Wald

df

Sig.

Exp(B)

Privilege

23.245

40190.4

0

1

1

1.2452E+10

Shared

23.298

40190.4

0

1

1

1.3126E+10

OS1

-21.767

40190.4

0

1

1

0

OS2

-20.092

40190.4

0

1

1

0

OS3

-17.621

40190.4

0

1

1

0

OS4

-21.089

40190.4

0

1

1

0

OS5

-21.027

40190.4

0

1

1

0

APP1

-23.029

3807.659

0

1

0.995

0

APP2

1.317

1.337

0.97

1

0.325

3.732

APP3

0.148

0.671

0.048

1

0.826

1.159

APP4

-2.742

1.252

4.797

1

0.029

0.064

APP5

-20.417

10197.04

0

1

0.998

0

APP6

-24.35

9738.721

0

1

0.998

0

APP7

1.383

0.677

4.174

1

0.041

3.987

APP8

-20.627

10741.68

0

1

0.998

0

APP9

23.187

10280.24

0

1

0.998

1.175E+10

APP10

-4.832

1.764

7.508

1

0.006

0.008

APP11

21.445

13359.75

0

1

0.999

2057356196

Production

-0.484

0.667

0.527

1

0.468

0.616

Development

0.233

0.892

0.068

1

0.794

1.262

Test

-3.412

1.768

3.725

1

0.054

0.033

Constant

-1.611

1.326

1.476

1

0.224

0.2

Scroll to view full table

Table 4: Variable significance

Further Analysis

A high number of variables suggests that many different environments exist for model building, meaning that more models must be developed to improve decision-making. In our test with 21 variables, we could have considered a number of different models.

The combination of the different variables and the selection of the variables must be reviewed each time new models are built. Depending on the environment, different models can suggest different independent variables that are significant, which impacts the accuracy of predicting the disabling of user IDs. Therefore, the model building activity is an iterative process.

Decision-Making

The models should help security teams make better decisions, which can help reduce user ID review efforts significantly. For the test described in this article, we can create the following criteria to enable decision-making:

  1. If the environment is not important and the accuracy of prediction is greater than 80 percent, the application owner can decide to delete the user ID and look at any exceptions if users revert back.
  2. If the environment is important, the application owner should delete the user ID only if the accuracy of the prediction is greater than 95 percent.
  3. If the environment is critical, application owners might decide to use manual methods and disregard the information provided by the models.

These three criteria help security teams determine how to take action without having to review all user IDs individually with the server and application owners.

Limitations

The user IDs mentioned in this article refer to end-user IDs. A user ID can be linked to many accounts, and deleting an ID can impact other relevant access privileges for the user. This approach has risks, so careful planning and decision-making are crucial to reduce impact. The context provided here does not apply to all situations, and if you have effectively deployed user access review and management controls, such a complex approach is not required.

User Access Review Modeling Is a Means to an End

The model developed through this process is a means to an end, not an end in itself. It must be analyzed and tested, both statistically and functionally, to determine its relevance to the environment. While variable selection is always the most critical component of this process, the model will vary from environment to environment. However, machine learning can help any organization stretch its resources and improve decision-making during the otherwise arduous user ID cleanup process.

More from Identity & Access

Obtaining security clearance: Hurdles and requirements

3 min read - As security moves closer to the top of the operational priority list for private and public organizations, needing to obtain a security clearance for jobs is more commonplace. Security clearance is a prerequisite for a wide range of roles, especially those related to national security and defense.Obtaining that clearance, however, is far from simple. The process often involves scrutinizing one’s background, financial history and even personal character. Let’s briefly explore some of the hurdles, expectations and requirements of obtaining a…

From federation to fabric: IAM’s evolution

15 min read - In the modern day, we’ve come to expect that our various applications can share our identity information with one another. Most of our core systems federate seamlessly and bi-directionally. This means that you can quite easily register and log in to a given service with the user account from another service or even invert that process (technically possible, not always advisable). But what is the next step in our evolution towards greater interoperability between our applications, services and systems?Identity and…

X-Force Threat Intelligence Index 2024 reveals stolen credentials as top risk, with AI attacks on the horizon

4 min read - Every year, IBM X-Force analysts assess the data collected across all our security disciplines to create the IBM X-Force Threat Intelligence Index, our annual report that plots changes in the cyber threat landscape to reveal trends and help clients proactively put security measures in place. Among the many noteworthy findings in the 2024 edition of the X-Force report, three major trends stand out that we’re advising security professionals and CISOs to observe: A sharp increase in abuse of valid accounts…

Topic updates

Get email updates and stay ahead of the latest threats to the security landscape, thought leadership and research.
Subscribe today