Data sources

Below are the data sources currently used in our services. A data source can be the original location where data is generated or where information is first digitized. It is a source of information as long as a process accesses and uses it. A data source is a facility that can provide users with information through a known protocol or format. The data source can be closed, open, structured, unstructured, curated, or uncurated. It can be a database, a flat file, live measurements from devices, or streaming data services, for example.

The bibliometric database

The bibliometric database at KTH, Bibmet, contains publication and citation data from Web of Science, the publication database DiVA , KTH publications from Scopus and Unpaywall. In Bibmet, data are processed and combined, normalized citation indicators are calculated, including and excluding self citations, and addresses are reviewed and unified. The processing makes high quality, advanced bibliometric analyses possible. There is also a classification created from clustering of publications based on citation links.

Bibmet is used internally by the bibliometric group at the library, and is the database that forms the basis of the bibliometric analyzes that can be obtained in the Annual Bibliometric Monitoring (ABM) system.

DiVA

DiVA is KTH's publication database. It contains publications produced by the university's researchers and students. Some metadata from DiVA is also relevant to Bibmet and is therefore exported there. KTH's metadata from DiVA is also delivered to the national publication database Swepub.

Web of Science

The Web of Science data source is a publisher-independent reference database with publication and citation data for secure detection, access, and evaluation. It contains bibliometric citations of multidisciplinary areas covering various journals in medicine, science, and social sciences, including humanities. The database is used to make more informed decisions at KTH and at other institutions.

Scopus

Scopus is a large database covering natural science, medicine, technology, social sciences, and humanities. The content in Scopus and Web of Science overlaps to a large extent but not completely. A bibliometric analysis will therefore not give exactly the same results in both databases.

HR data

The data source from HR contains data about the employees at KTH. This includes information about organizational affiliation, title, and length of employment.

UG

KTH's user database UG contains information about researchers and other users at KTH. Some of this information is available through KTH's Web APIs, but through UG it is also possible to do batch lookups when there are specific needs for information that cannot be obtained through KTH's Web APIs.

KTH's Web APIs

From KTH Profiles API, public data about people at KTH can be extracted. From the KTH Directory API, information about which researchers belong to a particular organizational unit can be obtained.

R packages that enable access to data sources

A number of R packages have been developed that enable access to data from various sources. These include packages for retrieving data from KTH's internal Web APIs, from DiVA publications, from external closed data sources such as Scopus and Web of Science, and from external open sources such as OpenAlex, CORDIS, SweCRIS, and more.

Learn more about R packages

Data flow (Data Pipeline)

Both internal and external data are combined in several of the services. Relevant data is collected from various sources in an intermediate layer, so-called object storage, where "buckets" are used to group datasets that belong together. Access is enabled from this layer via the S3 protocol. An open-source component called Minio is used for this storage. Data is stored in formats that can be used in many contexts and provide minimal lock-in to specific products when used (for example, in formats such as CSV, Parquet, and more).

Based on this collected data, processing is then carried out for various purposes, in accordance with the needs and requirements of different services and applications. In the processing stage, either more traditional database technologies or various tools from the "data science" toolbox, such as R, Python, and more, are used. Another example of an open-source component used in the processing stage is the in-memory OLAP database "duckdb," which supports newer formats such as Parquet and Arrow and can also quickly query data that exists in other databases (like SQLite and Postgres).

In the presentation stage, interactive web applications or interactive reports are often used.