InGef uses a wide range of current statistical methods for the analysis of routine health insurance data. The choice of methodology is not only based on practical feasibility considerations but also carefully aligned with the specific goals of each research project. Central to this approach is the development of high-performance analysis programs, utilizing in-house developed functions in the programming languages R, SQL, and Python, which are designed to meet the challenges posed by the large volume of data in the research database.

InGef conducts a variety of studies based on anonymized routine data, including:

  • cross-sectional studies
  • case-control studies
  • longitudinal cohort studies
  • Cost-of-illness studies

To quantify the influence and statistical significance of variables of interest on study outcomes, InGef also utilizes regression models appropriate for the type of study. These include:

  • Generalized linear models (primarily linear and logistic regression)
  • Cox regressions (for survival time analyses)
  • Generalized Estimating Equations
  • Random Effects Models (to account for clusters)
  • Two-Part Models (e.g. for cost analyses)

To enhance the presentation of results, InGef employs various visualization techniques. Examples include Kaplan-Meier curves to describe survival times and Alluvial diagrams or Sankey plots to illustrate the progression of treatments.

For select questions, modern machine learning methods are applied in addition to classical statistical methods. In various research projects, for instance, deep learning methods as well as natural language processing have been employed for the prediction of healthcare costs.

A particular focus of InGef lies in ensuring the comparability of patient populations within observational studies. In addition to standardizations and exact matching of populations on specific characteristics (age, gender), the propensity score matching (PSM) method is often used. In a research project of the German Federal Ministry for Economic Affairs and Climate Action, InGef investigated how the complete database can be used to enhance PSM with the help of machine learning methods. Alternative propensity score applications such as inverse-probability-of-treatment weighting and stratification have also been successfully implemented in previous research projects.


Health services research projects and pharmacoepidemiological questions can be addressed using the Research Database (RDB), which contains anonymized billing data from approximately 8.8 million statutorily insured persons across 52 health insurance funds (see also the publication by Ludwig, Enders, Basedow, Walker, and Jacob). In addition to sociodemographic information, InGef’s RDB contains data on drug prescriptions as well as outpatient and inpatient treatments over a period of six consecutive calendar years, which are longitudinally linked.

It goes without saying that we are committed to adhering to national and international recommendations for conducting and publishing research findings. Among others, we work in accordance with: