bupaR Docs | Create Logs (2024)

Transforming your raw data into an event log object is one of themost challenging tasks in process analysis. On this page, we cover allthe possible situations and challenges that you can encounter.

Logs: `eventlog` vs `activitylog`

bupaR supports two different kinds of log formats, bothof which are an extension on R data.frame:

eventlog: Event logs are created fromdata.frame in which each row represents a single event.This means that it has a single timestamp.
activitylog: Activity logs are createdfrom data.frame in which each row represents a singleactivity instances. This means it can has multiple timestamps, stored indifferent columns.

The data model below shows the difference between these two levels ofobservations, i.e.activity instances vs events.

The example below shows an excerpt of an event log containing 6events. It can be seen that each event is linked to a single timestamp.As there can be more events within a single activity instance, eachevent also needs to be linked to a lifecycle status (here theregistration_type). Furthermore, an activity instance identifier(handling_id) is needed to indicated which events belong to the sameactivity instances.

handling	patient	employee	handling_id	registration_type	time
Registration	333	r1	333	start	2017-11-15 16:50:59
Registration	333	r1	333	complete	2017-11-15 18:45:18
Triage and Assessment	333	r2	833	start	2017-11-16 20:37:26
Triage and Assessment	333	r2	833	complete	2017-11-17 08:21:08
Blood test	333	r3	1152	start	2017-11-17 22:27:09
Blood test	333	r3	1152	complete	2017-11-18 03:16:03

Transactional lifecycle?

An event is an atomic registration related to an activity instance. It thus contains one (and only one) timestamp. Additionally, the event should include a reference to a lifecycle transition. More specifically, multiple events can describe different lifecycle transitions of a single activity instance. For example, one event might record when a surgery is scheduled, another when it is started, yet another when it is completed, etc.

The standard transactional lifecycle.

handling	patient	employee	handling_id	complete	start
Registration	333	r1	333	2017-11-15 18:45:18	2017-11-15 16:50:59
Triage and Assessment	333	r2	833	2017-11-17 08:21:08	2017-11-16 20:37:26
Blood test	333	r3	1152	2017-11-18 03:16:03	2017-11-17 22:27:09

The right log for the job

Functionalities in bupaR core packages support both formats. 1 As such,the goal of your analysis does not impact the decision. Only thecomplexity of your data is important to make this decision. The preciseformat your raw data is in will further define the preparatory stepsthat are needed. We can distinguish between 3 typical scenarios. Theflowchart below helps you on your way.

An activitylog is the best option when each row in yourdata is an activity instance, or when events belonging to the sameactivity instance have equal attribute values (e.g.all events areexecuted by the same resource). When these two criteria do not hold, youcan create an eventlog object.

Scenario 1

If each row in your data.frame is already an activityinstance, the activitylog format is the best way to go.Consider the data sample below.

patient	handling	activity_started	activity_ended
464	Blood test	2018-04-06 20:04:09	2018-04-07 01:18:17
464	Check-out	2018-04-12 19:02:11	2018-04-12 21:41:01
464	Discuss Results	2018-04-12 11:00:16	2018-04-12 13:59:44
464	MRI SCAN	2018-04-07 06:30:56	2018-04-07 09:37:26
464	Registration	2018-03-20 19:07:17	2018-03-20 21:15:41
464	Triage and Assessment	2018-03-21 15:58:55	2018-03-22 05:21:56

As each row contains multiple timestamps, i.e.activity_started andactivity_ended, it is clear that each row represents an activityinstance. Turning this dataset in an activitylog requiresthe following steps:

Timestamp variables should be named in correspondence with thestandard Transactional lifecycle.
Timestamp variables should be of type Date orPOSIXct.
Use the activitylog constructor function.

data %>% # rename timestamp variables appropriately dplyr::rename(start = activity_started,  complete = activity_ended) %>% # convert timestamps to  convert_timestamps(columns = c("start", "complete"), format = ymd_hms) %>% activitylog(case_id = "patient", activity_id = "handling", timestamps = c("start", "complete"))

## # Log of 12 events consisting of:## 1 trace ## 1 case ## 6 instances of 6 activities ## 0 resources ## Events occurred from 2018-03-20 19:07:17 until 2018-04-12 21:41:01 ## ## # Variables were mapped as follows:## Case identifier: patient ## Activity identifier: handling ## Resource identifier: employee ## Timestamps: start, complete ## ## # A tibble: 6 × 5## patient handling start complete .order## <chr> <fct> <dttm> <dttm> <int>## 1 464 Blood test 2018-04-06 20:04:09 2018-04-07 01:18:17 1## 2 464 Check-out 2018-04-12 19:02:11 2018-04-12 21:41:01 2## 3 464 Discuss Results 2018-04-12 11:00:16 2018-04-12 13:59:44 3## 4 464 MRI SCAN 2018-04-07 06:30:56 2018-04-07 09:37:26 4## 5 464 Registration 2018-03-20 19:07:17 2018-03-20 21:15:41 5## 6 464 Triage and Assessment 2018-03-21 15:58:55 2018-03-22 05:21:56 6

Note that in case a resource identifier is available, thisinformation can be added in the activitylog call.

Scenario 2

If each row in your data.frame is an event, but allevents that belong to the same activity instance share the sameattribute values, the activitylog format is again the bestway to go. Consider the data sample below.

patient	handling	employee	handling_id	registration_type	time
227	Registration	r1	227	started	2017-08-09 19:55:30
227	Triage and Assessment	r2	727	started	2017-08-09 22:17:43
227	Registration	r1	227	completed	2017-08-09 22:17:43
227	Triage and Assessment	r2	727	completed	2017-08-10 15:21:30
227	Blood test	r3	1109	started	2017-08-17 03:01:24
227	Blood test	r3	1109	completed	2017-08-17 09:17:20
227	MRI SCAN	r4	1346	started	2017-08-17 13:15:04
227	MRI SCAN	r4	1346	completed	2017-08-17 18:47:44
227	Discuss Results	r6	1961	started	2017-08-22 13:33:38
227	Check-out	r7	2456	started	2017-08-22 15:38:38
227	Discuss Results	r6	1961	completed	2017-08-22 15:38:38
227	Check-out	r7	2456	completed	2017-08-22 17:12:46

The resource identifier (employee) has been added as an additionalattribute. Note that though each row is an event, they can be groupedinto activity instances using the handling_id column, which we will callthe activity instance id. Using the latter, we can see that the resourceattribute is the same within each activity instance, which allows us tocreate an activitylog. The steps to do so are thefollowing.

Lifecycle variable should be named in correspondence with thestandard Transactional lifecycle.
Timestamp variable should be of type Date orPOSIXct.
Use the eventlog constructor function.
Convert to activitylog usingto_activitylog for reduced memory usage and improvedperformance.

data %>% # recode lifecycle variable appropriately dplyr::mutate(registration_type = forcats::fct_recode(registration_type,  "start" = "started", "complete" = "completed")) %>% convert_timestamps(columns = "time", format = ymd_hms) %>% eventlog(case_id = "patient", activity_id = "handling", activity_instance_id = "handling_id", lifecycle_id = "registration_type", timestamp = "time", resource_id = "employee") %>% to_activitylog() -> tmp_act

Note that the resource identifier is optional, and can be left out ofthe eventlog call if such an attribute does not exist inyour data. If the activity instance id does not exist, some heuristicsare available to generate it: [Missing activity instanceidentifier].

Scenario 3

If each row is an event, and events of the same activity instancehave differing attribute values, the flexibility ofeventlog objects is required. Consider the data samplebelow.

patient	handling	employee	handling_id	registration_type	time
116	Registration	r2	116	started	2017-04-29 03:24:59
116	Registration	r6	116	completed	2017-04-29 06:23:09
116	Triage and Assessment	r1	616	started	2017-04-29 15:41:27
116	Triage and Assessment	r7	616	completed	2017-04-30 03:04:21
116	Blood test	r4	1054	started	2017-04-30 15:13:28
116	Blood test	r6	1054	completed	2017-04-30 21:24:18
116	MRI SCAN	r1	1291	started	2017-05-01 01:12:51
116	MRI SCAN	r4	1291	completed	2017-05-01 05:32:37
116	Discuss Results	r3	1850	started	2017-05-01 09:44:20
116	Discuss Results	r7	1850	completed	2017-05-01 14:00:48
116	Check-out	r3	2345	started	2017-05-03 04:02:35
116	Check-out	r2	2345	completed	2017-05-03 06:16:03

In this example, different resources (employees) sometimes performthe start and complete event of the same activity instance. Therefore,we resort to the eventlog format which has no problemsstoring this. The steps to take are the following:

Lifecycle variable should be named in correspondence with thestandard Transactional lifecycle.
Timestamp variable should be of type Date orPOSIXct.
Use the eventlog constructor function.

data %>% # recode lifecycle variable appropriately dplyr::mutate(registration_type = forcats::fct_recode(registration_type,  "start" = "started", "complete" = "completed")) %>% convert_timestamps(columns = "time", format = ymd_hms) %>% eventlog(case_id = "patient", activity_id = "handling", activity_instance_id = "handling_id", lifecycle_id = "registration_type", timestamp = "time", resource_id = "employee")

## Warning in validate_eventlog(eventlog): The following activity instances are## connected to more than one resource: 1054,116,1291,1850,2345,616

## # Log of 12 events consisting of:## 1 trace ## 1 case ## 6 instances of 6 activities ## 6 resources ## Events occurred from 2017-04-29 03:24:59 until 2017-05-03 06:16:03 ## ## # Variables were mapped as follows:## Case identifier: patient ## Activity identifier: handling ## Resource identifier: employee ## Activity instance identifier: handling_id ## Timestamp: time ## Lifecycle transition: registration_type ## ## # A tibble: 12 × 7## patient handling employee handling_id registration_type time ## <chr> <fct> <fct> <chr> <fct> <dttm> ## 1 116 Registrat… r2 116 start 2017-04-29 03:24:59## 2 116 Registrat… r6 116 complete 2017-04-29 06:23:09## 3 116 Triage an… r1 616 start 2017-04-29 15:41:27## 4 116 Triage an… r7 616 complete 2017-04-30 03:04:21## 5 116 Blood test r4 1054 start 2017-04-30 15:13:28## 6 116 Blood test r6 1054 complete 2017-04-30 21:24:18## 7 116 MRI SCAN r1 1291 start 2017-05-01 01:12:51## 8 116 MRI SCAN r4 1291 complete 2017-05-01 05:32:37## 9 116 Discuss R… r3 1850 start 2017-05-01 09:44:20## 10 116 Discuss R… r7 1850 complete 2017-05-01 14:00:48## 11 116 Check-out r3 2345 start 2017-05-03 04:02:35## 12 116 Check-out r2 2345 complete 2017-05-03 06:16:03## # ℹ 1 more variable: .order <int>

Note that we need an eventlog irrespective of whichattribute values are differing, i.e.it can be resources, but also anyadditional variables you have in your data set. For the special case ofresource values, it might be that a different resource executing eventsin the same activity instance is a data quality issue. If so, somefunctions can help you to identify this issue: Inconsistent Resources.

Again, if the activity instance id does not exist, some heuristicsare available to generate it: [Missing activity instanceidentifier].

Typical problems

Missing activity instance id

In order to be able to correlate events which belong to the sameactivity instance, an activity instance identifier is required. Forexample, in the data shown below, it is possible that a patient has gonethrough different surgeries, each with their own start- and completeevent. The activity instance identifier will then allow to distinguishwhich events belong together and which do not. It is important to notethat this instance identifier should be unique, also among differentcases and activities.

patient	activity	timestamp	status	activity_instance
John Doe	check-in	2017-05-10 08:33:26	complete	1
John Doe	surgery	2017-05-10 08:53:16	start	2
John Doe	surgery	2017-05-10 09:25:19	complete	2
John Doe	treatment	2017-05-10 10:01:25	start	3
John Doe	treatment	2017-05-10 10:35:18	complete	3
John Doe	surgery	2017-05-10 10:41:35	start	4
John Doe	surgery	2017-05-10 11:05:56	complete	4
John Doe	check-out	2017-05-11 14:52:36	complete	5

If the activity instance identifier is not available you can use theassign_instance_id() function, which uses an heuristic tocreate the missing identifier. Alternatively, you can try to create theidentifier on your own using dplyr::mutate() and othermanipulation functions.

Large Datasets and Validation

By default, bupaR validates certain properties of theactivity instances that is supplied when creating an event log:

a single activity instance identifier must not be connected tomultiple cases,
a single activity instance identifier must not be connected tomultiple activity labels,

However, these checks are not efficient and may lead to considerableperformance issues for large data frames. It is possible to deactivatethe validation in case you already know that your data fulfills all therequirements, using the argument validate = FALSE whencreating the eventlog. Note that when the activity instanceid was created with the assign_instance_id() function, youcan assume the above properties hold.

Inconsistent Resources

Each event can contain the notion of a resource. It can be so thatdifferent events belonging to the same activity instance are executed bydifferent resources, as in the eventlog below.

patient	handling	employee	handling_id	registration_type	time	.order
206	Registration	r4	206	start	2017-07-19 15:48:14	1
206	Triage and Assessment	r6	706	start	2017-07-19 17:03:44	2
206	Registration	r3	206	complete	2017-07-19 17:03:44	3
206	Triage and Assessment	r7	706	complete	2017-07-20 07:28:53	4
206	Blood test	r1	1100	start	2017-07-25 03:02:14	5
206	Blood test	r3	1100	complete	2017-07-25 08:14:46	6
206	MRI SCAN	r6	1337	start	2017-07-25 12:37:36	7
206	MRI SCAN	r2	1337	complete	2017-07-25 16:52:16	8
206	Discuss Results	r2	1940	start	2017-07-26 07:36:36	9
206	Discuss Results	r4	1940	complete	2017-07-26 11:08:03	10
206	Check-out	r1	2435	start	2017-07-28 02:54:17	11
206	Check-out	r7	2435	complete	2017-07-28 03:55:13	12

If you have a large dataset, and want to have an overview of theactivity instances that have more than one resource connected to them,you can use the detect_resource_inconsistences()function.

log %>% detect_resource_inconsistencies()

## # A tibble: 6 × 5## patient handling handling_id complete start## <chr> <fct> <chr> <chr> <chr>## 1 206 Blood test 1100 r3 r1 ## 2 206 Check-out 2435 r7 r1 ## 3 206 Discuss Results 1940 r4 r2 ## 4 206 MRI SCAN 1337 r2 r6 ## 5 206 Registration 206 r3 r4 ## 6 206 Triage and Assessment 706 r7 r6

If you want to remove these inconsistencies, a quick fix is to mergethe resource labels together withfix_resource_inconsistencies(). Note that this is notneeded for eventlog, but it is foractivitylog. While the creation of theeventlog will emit a warning when resource inconsistenciesexist, this should mostly be seen as a data quality warning. That said,there might be analysis related to the counting of resources where suchinconsistencies might lead to odd results.

log %>% fix_resource_inconsistencies()

## *** OUTPUT ***

## A total of 6 activity executions in the event log are classified as inconsistencies.

## They are spread over the following cases and activities:

## # A tibble: 6 × 5## patient handling handling_id complete start## <chr> <fct> <chr> <chr> <chr>## 1 206 Blood test 1100 r3 r1 ## 2 206 Check-out 2435 r7 r1 ## 3 206 Discuss Results 1940 r4 r2 ## 4 206 MRI SCAN 1337 r2 r6 ## 5 206 Registration 206 r3 r4 ## 6 206 Triage and Assessment 706 r7 r6

## Inconsistencies solved succesfully.

## # Log of 12 events consisting of:## 1 trace ## 1 case ## 6 instances of 6 activities ## 6 resources ## Events occurred from 2017-07-19 15:48:14 until 2017-07-28 03:55:13 ## ## # Variables were mapped as follows:## Case identifier: patient ## Activity identifier: handling ## Resource identifier: employee ## Activity instance identifier: handling_id ## Timestamp: time ## Lifecycle transition: registration_type ## ## # A tibble: 12 × 7## patient handling employee handling_id registration_type time ## <chr> <fct> <chr> <chr> <fct> <dttm> ## 1 206 Registrat… r3 - r4 206 start 2017-07-19 15:48:14## 2 206 Triage an… r7 - r6 706 start 2017-07-19 17:03:44## 3 206 Registrat… r3 - r4 206 complete 2017-07-19 17:03:44## 4 206 Triage an… r7 - r6 706 complete 2017-07-20 07:28:53## 5 206 Blood test r3 - r1 1100 start 2017-07-25 03:02:14## 6 206 Blood test r3 - r1 1100 complete 2017-07-25 08:14:46## 7 206 MRI SCAN r2 - r6 1337 start 2017-07-25 12:37:36## 8 206 MRI SCAN r2 - r6 1337 complete 2017-07-25 16:52:16## 9 206 Discuss R… r4 - r2 1940 start 2017-07-26 07:36:36## 10 206 Discuss R… r4 - r2 1940 complete 2017-07-26 11:08:03## 11 206 Check-out r7 - r1 2435 start 2017-07-28 02:54:17## 12 206 Check-out r7 - r1 2435 complete 2017-07-28 03:55:13## # ℹ 1 more variable: .order <int>

CreateLogsAdjustLogsPublicLogsXesFilesInspectLogsDataQuality