Organizations these days are very dynamic with their distributed workforces and applications. As these companies grow, their IT teams must perform the increasingly complex task of user access management.
Security professionals define access control policies and controls that mandate user access review on a periodic basis. The user access list is sent to application owners for review, and user IDs are disabled or deleted as necessary. The user access review control helps ensure that unauthorized users do not continue to exist in the system if the user ID is deleted during the normal offboarding process.
However, if user access reviews have not been carried out diligently or organizational changes impede access management, the user ID deletion process becomes quite complex. For instance, if a large multinational corporation has been growing through mergers and acquisitions, the integration of user access systems is complicated and cumbersome. Statutory audits in these organizations often reveal noncompliance related to user access management.
The Tedium of Manual User Access Review
Organizations set up projects to clean up unauthorized users IDs in the system, and the initiative gets driven through regulatory requirements or the company’s security policy. Due to its manual nature, this process requires a concerted effort and ample resources.
Such a project typically involves the following steps:
- Generate a user ID report from the relevant systems.
- Send reports to application and system owners for review.
- Set up meetings with owners and stakeholders.
- Disable or delete user IDs based on feedback from owners and stakeholders.
For large organizations with many systems and applications, and potentially thousands of users, the user ID cleanup process is very tedious. A machine learning algorithm can help security teams streamline this process to make decisions more efficiently.
Using Machine Learning for Faster Decision-Making
User ID cleanup projects are effort-intensive and require complex interactions between various stakeholders, increasing the cost of performing these activities. Administrators have to interact with many different stakeholders to know which user IDs have to be deleted. Often, application or system owners fail to provide input at the right time, putting the organization at risk of potential unauthorized users. This also contributes to noncompliance with regulations and security policies.
Machine learning algorithms can help security teams model user ID disabling activities. These algorithms do not solve the complex problems related to user access review, but they aid in decision-making, which is a critical need in such projects.
The process begins with the identification of user ID environments for model building. User ID information is generated in the form of raw data, which is then prepared and normalized for logistic regression. Finally, a model is developed and results are analyzed accordingly. The graphic below illustrates the steps of this process:
Figure 1: The user ID cleanup process
Preparing for Model Building
Organizations can have thousands of user IDs with heterogeneous environments, so cleanup projects may require building many models to reflect that. For instance, user IDs in development, production and test environments might each fit into different models. However, the analysis performed in the example below considered development, production and test environments in the same model. The best models have to be determined and should be based on a number of factors, such as organizational unit, geography, type of application and type of environment.
Generating Raw Data
Raw data is generated based on the agreed-upon model and is usually maintained in spreadsheets or exported through a central user ID management system. The raw data, which can include details related to users, servers and applications, will depend on the models selected during the previous stage.
For our test, raw data is generated for about 3,000 user IDs and sanitized for logistic regression. This data includes the details outlined below:
Area
|
Details
|
User
|
Account type (shared, privileged, third-party, etc.)
|
Server
|
Operating system
|
Application
|
Application name
|
System Environment
|
Production server, test server, development server
|
Scroll to view full table
Table 1: Raw data
Sampling
A random sample of 348 user IDs is taken to build the model, which is then applied to the rest of the user IDs. The sample user IDs are then discussed with application and server owners to determine whether they should be deleted. This introduces a new column labeled “ID to Be Disabled,” which serves as the dependent variable. The new column is populated based on discussions with the application and server owners.
Data Preparation and Variable Selection
The information gleaned from the application and server owners enables the security team to select the independent variable. For our test, we identified the following variables:
Area
|
Variable
|
Number of Variables
|
User
|
Shared, privilege
|
2 variables
|
Server
|
Operating systems
|
5 variables
|
Application
|
Application type
|
11 variables
|
Environment
|
Production, development and test
|
3 variables
|
Total Variables
|
21 variables
|
Scroll to view full table
Table 2: Variables
The variable identification activity is also effort-intensive. A high number of variables suggests that many different environments could exist, which may require more than one model for decision-making.
The variables outlined in table 2 are in text format and must be prepared for logistic regression. The textual information in the variables is then converted into binary form.
Results and Analysis
The prepared data is fed into logistic regression to obtain the following model with 21 variables:
Z = -1.611 + 23.245*Privilege + 23.298*Shared- 21.767*OS1 – 20.092*OS2 – 17.621*OS3 – 21.089*OS4 – 21.027*OS5 – 23.029*APP1 – 1.317*APP2 + 0.148*APP3 – 2.742*APP4 – 20.417*APP5 – 24.350*APP6 + 1.383*APP7 – 20.627*APP8 + 23.187*APP9 – 4.8328*APP10 + 21.445*APP11 – 0.484*Production + 0.233*Development – 3.412*Test
The probability of disabling the user ID is calculated using the following formula:
Probability of disabling user ID = exp (Z) /1+exp (Z)
The model is then applied to rest of the user IDs in the organization to determine the probability of disabling the user IDs. At this point, decision criteria are required to help the security team determine how to take action.
Accuracy of Prediction
The model has an accuracy rate of 86.2 percent at a cutoff of 40 percent, as shown in table 3 below. The model predicted 300 out of 348 user IDs accurately.
Table 3: Model accuracy
Variable Significance
Table 4 shows the significance of different variables. Of the 21 variables, only the four — APP4, APP7, APP10 and TEST — are significant to the model.
|
B
|
S.E. |
Wald
|
df
|
Sig. |
Exp(B)
|
Privilege
|
23.245
|
40190.4
|
0
|
1
|
1
|
1.2452E+10
|
Shared
|
23.298
|
40190.4
|
0
|
1
|
1
|
1.3126E+10
|
OS1
|
-21.767
|
40190.4
|
0
|
1
|
1
|
0
|
OS2
|
-20.092
|
40190.4
|
0
|
1
|
1
|
0
|
OS3
|
-17.621
|
40190.4
|
0
|
1
|
1
|
0
|
OS4
|
-21.089
|
40190.4
|
0
|
1
|
1
|
0
|
OS5
|
-21.027
|
40190.4
|
0
|
1
|
1
|
0
|
APP1
|
-23.029
|
3807.659
|
0
|
1
|
0.995
|
0
|
APP2
|
1.317
|
1.337
|
0.97
|
1
|
0.325
|
3.732
|
APP3
|
0.148
|
0.671
|
0.048
|
1
|
0.826
|
1.159
|
APP4
|
-2.742
|
1.252
|
4.797
|
1
|
0.029
|
0.064
|
APP5
|
-20.417
|
10197.04
|
0
|
1
|
0.998
|
0
|
APP6
|
-24.35
|
9738.721
|
0
|
1
|
0.998
|
0
|
APP7
|
1.383
|
0.677
|
4.174
|
1
|
0.041
|
3.987
|
APP8
|
-20.627
|
10741.68
|
0
|
1
|
0.998
|
0
|
APP9
|
23.187
|
10280.24
|
0
|
1
|
0.998
|
1.175E+10
|
APP10
|
-4.832
|
1.764
|
7.508
|
1
|
0.006
|
0.008
|
APP11
|
21.445
|
13359.75
|
0
|
1
|
0.999
|
2057356196
|
Production
|
-0.484
|
0.667
|
0.527
|
1
|
0.468
|
0.616
|
Development
|
0.233
|
0.892
|
0.068
|
1
|
0.794
|
1.262
|
Test
|
-3.412
|
1.768
|
3.725
|
1
|
0.054
|
0.033
|
Constant
|
-1.611
|
1.326
|
1.476
|
1
|
0.224
|
0.2
|
Scroll to view full table
Table 4: Variable significance
Further Analysis
A high number of variables suggests that many different environments exist for model building, meaning that more models must be developed to improve decision-making. In our test with 21 variables, we could have considered a number of different models.
The combination of the different variables and the selection of the variables must be reviewed each time new models are built. Depending on the environment, different models can suggest different independent variables that are significant, which impacts the accuracy of predicting the disabling of user IDs. Therefore, the model building activity is an iterative process.
Decision-Making
The models should help security teams make better decisions, which can help reduce user ID review efforts significantly. For the test described in this article, we can create the following criteria to enable decision-making:
- If the environment is not important and the accuracy of prediction is greater than 80 percent, the application owner can decide to delete the user ID and look at any exceptions if users revert back.
- If the environment is important, the application owner should delete the user ID only if the accuracy of the prediction is greater than 95 percent.
- If the environment is critical, application owners might decide to use manual methods and disregard the information provided by the models.
These three criteria help security teams determine how to take action without having to review all user IDs individually with the server and application owners.
Limitations
The user IDs mentioned in this article refer to end-user IDs. A user ID can be linked to many accounts, and deleting an ID can impact other relevant access privileges for the user. This approach has risks, so careful planning and decision-making are crucial to reduce impact. The context provided here does not apply to all situations, and if you have effectively deployed user access review and management controls, such a complex approach is not required.
User Access Review Modeling Is a Means to an End
The model developed through this process is a means to an end, not an end in itself. It must be analyzed and tested, both statistically and functionally, to determine its relevance to the environment. While variable selection is always the most critical component of this process, the model will vary from environment to environment. However, machine learning can help any organization stretch its resources and improve decision-making during the otherwise arduous user ID cleanup process.
Senior Managing Consultant, IBM