GenAI in Medical Software Development

A Focused Internal Study on Microsoft GitHub Copilot

Decreasing time-to-market for digital MedTech solutions is crucial in today's fast-paced healthcare environment. Rapid advancements in technology and evolving patient needs demand timely innovations that can improve patient outcomes and streamline healthcare processes. By accelerating the development and deployment of these solutions, companies can gain a competitive edge, respond to market demands more effectively, and ultimately enhance the quality of care. Faster time-to-market not only benefits businesses but also ensures that patients receive timely access to cutting-edge medical technologies that can significantly impact their health and well-being.

In that context we conducted secondary and primary research. This internal study at ZEISS Digital Innovation focused on how GenAI could support the acceleration of medical software development.

Exploring Secondary Research: The Impact of Generative AI on Medical Software Development and Quality Outcomes

The integration of generative AI tools into the development lifecycle of medical software promises significant improvements in both development efficiency and quality^[1,2].  These tools can be used for example to automate code and test generation, code improvements, and to provide intelligent suggestions and solutions across the entire software development life cycle. Several studies have demonstrated the considerable potential of solutions like OpenAI’s ChatGPT and Microsoft’s GitHub Copilot. These tools can enhance and possibly transform the current software development process^[3,4].

For instance, GitHub has published a study showing an increase of software developer productivity in various scenarios by up to 55% through using Microsoft GitHub Copilot^[2] Additionally, the study claims that the adoption of generative AI tools can enhance employee satisfaction by shifting focus and efforts from monotonous tasks to engineering design and creative tasks. These results are supported by similar findings from other sources. McKinsey & Company^[5] have reported significant time savings for different software developing tasks, including a 45–50% reduction in time spent documenting code functionality, a 35–45% reduction in time required for writing new code and a 30% reduction in time needed for refactoring code.

In terms of total software engineering time, a study from Bain & Company^[6] showed savings of between 10% to 15% through the application of generative AI tools, with potential improvements of 30% more when fully leveraging generative AI capabilities. Similarly, another study by the Massachusetts Institute of Technology^[7] revealed the influence of tenure by analyzing productivity measurements from several companies that are utilizing generative AI tools in their development process. The results indicate that short-tenure developers increased their output by 27% to 39% while long-tenure developers experienced more marginal gains of 8% to 13%.

However, alongside these promising benefits, a growing body of publications suggests potential negative impacts, such as the ease of generating non-optimal, unnecessary or excessive code, that requires manual improvements. An extensive study analyzing over 200 million lines of changed code from the largest open-source repositories revealed a trend of downwards pressure on overall code quality and increased code churn^[15] since the introduction of generative AI tools^[8]. With respect to programmers’ productivity, Uplevel^[9] found no change in efficiency metrics through GitHub Copilot, while the bug rate was increased by 41%. Similarly, a global study in partnership with The Upwork Research Institute concluded that 77% of employees report that AI has increased workloads and hampered productivity^[10]. Dakhel et al.^[11] studied the influence of tenure on the effective use of GitHub Copilot and observed, that the tool can become a liability if it is used by novice developers who may fail to filter buggy or non-optimal solutions due to a lack of expertise.

The referenced publications clearly demonstrate the significant potential of generative AI tools to enhance the entire software development life cycle. However, they also highlight the importance of understanding and mitigating the potential challenges associated with the introduction of these tools. This is particularly critical in the healthcare sector, where software quality is crucial to ensuring patient outcomes and compliance with regulatory, safety and security requirements. Thus, a deeper understanding of how generative AI tools can be effectively applied within real-world development projects for digital MedTech solutions is needed. This article presents an internal study by ZEISS Digital Innovation, Health and Life Science Solutions that focused on integrating generative AI tools into medical software development projects. It explores both the positive outcomes, and the challenges encountered.

Objective and Approach of ZEISS Digital Innovation Study

Objective

Our objective was to empirically investigate how GitHub Copilot can enhance the development of medical software. We aimed to uncover both efficiencies and limitations in this process.

Approach

We evaluated the capabilities of GitHub Copilot through two distinct approaches:

Conducting A/B testing on the same proof-of-concept project by comparing two versions: one utilizing GitHub Copilot and the other operating without it.
Integrating GitHub Copilot into existing digital medical solution projects that were already in progress, utilizing SAFE agile methodologies and featuring substantial codebases.

An interesting aspect of this is that we can follow and monitor the effect of using GitHub Copilot in multiple scenarios. On the one hand starting from scratch on a new PoC project, what is generally easier with AI code generation tools. On the other hand, diving deep in projects where the development team is working on a product for several month(s) or year(s) and has already established well-formed coding guidelines, frameworks and complex CI/CD pipelines.

In preparation for the study, we established key metrics that encompassed both quantitative and subjective qualitative measures. Additionally, we were interested in exploring how the results would differ between the two approaches and whether these metric types would correlate with one another. For qualitative analysis, we surveyed involved software developers regarding the tool's effectiveness. The survey included questions about the influence on development efficiency, the quality of code suggestions, areas where the tool performed best, its limitations, primary use cases, and overall user satisfaction.

Regarding quantifying developer performance, we considered established metrics such as the DORA Metrics^[12] and the SPACE framework^[13]. However, these frameworks provide a long-term view of development performance and account for all parts of development activity, not just programming. They are designed for projects with frequent, often daily, deployments to production, which is not typical in the medical sector due to the significant documentation and validation overhead required. Therefore, these metrics were not well-suited for assessing development efficiency in this study's context. Instead, we focused on the change in merge frequency, i.e. the number of average pull requests per contributor, as outlined in the DevEx metrics^[14]. Additionally, to measure code quality, we collected the well-established metrics cyclomatic complexity, depth of inheritance, and class coupling through static code analysis.

The approach is to gather two teams of equal seniority colleagues with the goal of implementing the same proof-of-concept software project, a basic full-stack customer-relationship-management system. One team tackles the problem with AI assistance from GitHub Copilot and a company internal chatbot with a similar functionality to ChatGPT^[16] and the other team - without. Before the assessment started the AI-enabled team was provided with the onboarding instructions and a manual on how to use the technology in a more efficient way but was not specifically trained. Therefore, the approach simulates quite standard conditions as a first rollout.

Key principles for the assessment methodology:

A/B testing: Breaking down the testing into two groups. Running scenarios in a standard approach and with AI based tools. Flip-flop principle between two groups. One group tackles one problem with AI assistance and the other - without. 
User training: Provide adequate training or instructions to the users of the tool. This should include an overview of the tool, its features, and how to use it effectively. 
Data privacy and security: Ensure that the tool complies with the company's data privacy and security policies. This may involve reviewing the tool's data usage, access controls, and encryption. 
Identify the right team members who will be involved in the pilot program. This includes people who have experience with AI-based tools and those who will be using the tool on a daily basis in the future. 

Before the start of the assessment, use cases and scenarios were designed to ensure a structured evaluation. The study was divided into two primary streams of tasks for both teams: software development and testing. The teams used C# and the .Net-framework to implement the backend and database integration, JavaScript and React for the frontend and postman for test automation.

The functional and non-functional requirements for the proof-of-concept project were defined and mapped to an initial backlog of user stories by experienced business analysts, which served as the foundation for the development tasks. The primary objective for the development team was to implement these user stories in software and to write unit tests to verify the functionality of the code. This stream aimed to evaluate the efficacy of GitHub Copilot in enhancing code writing efficiency and accuracy, as well as in generating unit tests that validate the implemented functionality.

The task for the testing team was to specify test cases for each implemented user story and to implement automated integration and system tests. This stream aimed to assess the impact of ChatGPT-like technologies on the quality and efficiency of test case generation and of GitHub Copilot on the automation of testing processes. By focusing on both integration and system tests, the study aimed to cover a comprehensive range of testing activities that are critical in medical software development.
Our expectations were high regarding this phase of the study, as the project teams are working with cutting edge technologies, cloud platforms, strict deadlines, already established CI/CD pipelines and in highly scaled SAFE agile framework environments. Three pilot teams were selected that are working with rather large code repositories and using Medtech standards including DICOM and HL7. As before we enabled the tool, but not specifically trained the development team.

Key principles for the assessment methodology:

Minimum number of project teams: To avoid statistical errors, we measured multiple projects where GH Copilot can be applied, at the end 3 Agile projects were selected for the tests. 
Quantitative data: The benefit of the Pilot on existing projects that they have historical quantitative data, we can compare how these data change during and after the GitHub Copilot rollout.
Data privacy and security: During the project selection phase one crucial factor was to find perfect matches from the projects that have deeper Medtech scope, but not including company secrets or Intellectual Property related parts.

In this case, we did not restrict the use-cases and scenarios; instead, the entire experiment included a proactive exploration of tasks where our colleagues can effectively utilize GitHub Copilot next to their already shaped frameworks of working.

Ultimately, the overall goal of the entire study is to address the questions mentioned above, acquire insights from the use of Microsoft GitHub Copilot, identify the specific benefits it offers in projects with MedTech specifics, and to assess whether the implementation of this tool should be expanded company wide.

Key Learnings and Take aways of ZEISS Digital Innovation Study

We analyzed the impact of the Microsoft GitHub Copilot in couple of scenarios and summarized the findings below.

Results from the proof-of-concept

The results of the PoC are illustrated in Figure 1. The data clearly indicates a significant improvement in productivity attributable to the integration of GitHub Copilot and our company chatbot into the development process. Specifically, the study observed time savings of 23% for both development and test automation tasks. Notably, even greater time savings of 38% were recorded for test case creation, underscoring the tool's effectiveness in enhancing testing efficiency.

In addition to productivity gains, the quality of the code was also positively impacted, as evidenced by a decrease in code complexity and class coupling. These improvements suggest that the use of GitHub Copilot not only accelerates the development process but also contributes to the creation of higher-quality software.

While the results are promising, it is important to note that the complexity of the software project was relatively low, and the PoC was conducted as a greenfield project. This raises questions about the extent to which similar benefits can be realized in more complex projects that have been ongoing for several years, which may impose significant constraints on the development process.

Nevertheless, the findings from the PoC provide a strong foundation and confidence to proceed with further evaluations within real MedTech projects. The insights gained will inform on how the results found within the PoC transfer to the real world.

Results from development projects

As described in the previous chapters, we analyzed various scenarios, using different data collection methods to obtain an overview of the impact of GitHub Copilot on our projects. To move towards the conclusion, let's summarize the basic statistics of our study. In both cases—namely the A/B testing PoC projects and the ongoing MedTech projects—the development teams comprised dedicated roles essential for a Software Development project.

the first step was to gather data on how many team members were actively using the tool and the distribution of its usage among them.

As GitHub Copilot can be utilized in a range of scenarios and use-cases, we’ve set an objective of this step to gain a clear understanding of the exact use-cases that are effectively covered by the tool, and as an outcome result, identify the areas where ZDI needs to concentrate more efforts in relation to the AI tool empowerment in the Software Development Lifecycle.

During our interview sessions, we aimed to understand the key limitations and benefits of GitHub Copilot. As an intermediate result, we have compiled some of the most common feedback from end-users.

During our interview sessions, we aimed to understand the key limitations and benefits of GitHub Copilot. As an intermediate result, we have compiled some of the most common feedback from end-users.

Try to use the quick fixes suggested by GitHub Copilot to get an idea but always double check the quality of the answer.

It's a really great tool if it is used in the right way.

For prototyping, repetitive tasks and code chunks, or on basic "example" level codes it helps a lot.

Be a bit critical about the results, but it is great to get started on tasks.

Be very specific with your prompts and narrow them down to the current block of code you are working with.

It can help in smaller problems, but it won't solve you really complex issues.
Code suggestions are completely wrong, generated unit tests were using completely different framework that we had, during my daily duties it helped nothing.

Following a comprehensive summary of the feedback, we conducted an in-depth analysis of both positive and negative feedback. We were particularly interested in the negative feedback to understand the specific challenges where GitHub Copilot fell short in providing quality solutions.

We identified two distinct clusters: most of the positive feedback originated from the PoC projects, while nearly all negative feedback was associated with the ongoing projects featuring large codebases and well-defined processes. Specific scenarios where GitHub Copilot consistently provided incorrect solutions included handling DICOM data, working with health industry-specific libraries and standards, and utilizing newer features of programming languages or frameworks. In essence, GitHub Copilot underperformed in areas where the example datasets and technologies are not widely spread, and where tasks require specialized expertise in the health industry, assumingly due to a lack of available training data.

Results from cloud development projects

In the final phase of our experiment, we examined the ongoing cloud software development projects to gather quantitative data on how Project Managers perceive GitHub Copilot, based on numerical metrics from their projects. Our approach involved analyzing data from Azure DevOps, static code analysis, and software composition analysis across various categories. These categories included Git code changes, work item lead times, and additional statistics such as Pull Request metrics, CI/CD pipeline performance, code quality variations, risk and vulnerability assessments, along with several other factors.

Building on these plans, it is important to acknowledge the challenges we face in measuring the data due to the inherent peaks and releases associated with software development cycles. To ensure the integrity of our analysis, we meticulously cleaned the dataset by removing extreme values that could skew our results. However, it is worth noting that our measurement period was relatively short, spanning just three-four months per project. While the data indicated some level of improvement, it did not align as closely with the positive ratings provided by colleagues in their feedback forms. Additionally, the experiment was conducted in real-world scenarios rather than controlled lab environments, which introduced various fluctuations due to factors such as vacations and illnesses. These elements may have contributed to the discrepancies observed between the quantitative data and qualitative feedback.

Overall results based on qualitative assessment

Below you will find a summary of the qualitative feedback gathered during our study. We utilized a star rating system ranging from 1 to 5, where 1 represents the lowest rating and 5 represents the highest. Our objective was to assess how our colleagues perceived the quality of the code suggestions provided by GitHub Copilot. The following chart illustrates the ratings collected, highlighting the varying opinions on the effectiveness, trust, and overall experience of the tool.

As the name implies, GitHub Copilot serves as an assistant to the developer, as a copilot aiding a pilot in their tasks. We measured the extent to which our developers trust the suggested solutions, and it appears they maintain a more critical perspective. The overall trust in the tool is somewhat lower compared to the perceived quality of the suggestions provided.

Our study highlights the significant role of GitHub Copilot in enhancing productivity and code quality within MedTech software development. The tool was frequently used for code and test generation, indicating its utility in automating routine tasks. However, feedback revealed limitations in complex scenarios requiring specialized domain knowledge, such as handling DICOM data and health-specific libraries. Positive experiences were mainly associated with proof-of-concept projects, while ongoing projects with established processes reported challenges. Overall, the results not only affirm the potential of GitHub Copilot to enhance productivity and code quality but also emphasize the importance of context in evaluating its effectiveness.

Conclusion

In conclusion, our study has demonstrated that the implementation of Microsoft GitHub Copilot can lead to

10% - 15 %

overall performance gain per developer

The tool performs better in more simple tasks, such as writing unit tests, generating code snippets, and creating prototypes. The tool's knowledgebase allows it to enhance productivity in these areas, making it a valuable tool for our development teams. However, it is important to note that GitHub Copilot encounters challenges when dealing with less commonly used standards or problems, such as DICOM, HL7, and SDC, where its assistance is limited or in some scenarios not even possible.

Considering the cost-benefit analysis and the positive outcomes observed in our projects have led ZEISS Digital Innovation to expand the rollout of Microsoft GitHub Copilot across the organization. By leveraging the tool's strengths while acknowledging its limitations, we aim to enhance our development processes further and foster innovation in the MedTech sector. This strategic decision reflects our commitment to adopting advanced technologies that can drive efficiency and improve the quality of our software solutions.

Please reach out if you are interested in more details about our internal studies regarding generative AI in medical software development!

Author Dr. Julian Massing

Senior Solution Specialist
ZEISS Digital Innovation Health & Life Science Solutions

Write email
LinkedIn Profile

Author David Klusoczki

Senior Business Analyst
ZEISS Digital Innovation Health & Life Science Solutions

Write email
LinkedIn Profile

Get in touch with us!

Simply fill in the form below, and we will contact you shortly to discuss your ideas and needs.

Form is loading...

If you want to have more information on data processing at ZEISS, please refer to our data protection notice.

You might also be interested in:

¹
Rodriguez, D.V., et al. (2023): „Leveraging Generative AI Tools to Support the Development of Digital Solutions in Health Care Research: Case Study”, accessed at JMIR Human Factors - Leveraging Generative AI Tools to Support the Development of Digital Solutions in Health Care Research: Case Study
²
Jacobson, S., et al. (2023): „Is Generative AI Changing the Game for Medtech?”, accessed at Is Generative AI Changing the Game for Medtech? | Deloitte US
³
Dohmke, T., et al. (2023): “Sea Change in Software Development: Economic and Productivity Analysis of the AI-Powered Developer Lifecycle”. arXiv preprint arXiv:2306.15033. https://arxiv.org/ftp/arxiv/papers/2306/2306.15033.pdf
⁴
Zagorodniy, I., et. al. (2023): “Redefining the economics of software development”, accessed at https://info.softserveinc.com/hubfs/files/redefining-the-economics-of-software-development-gen-ai.pdf
⁵
Deniz, B.K., et al. (2023): „Unleash developer productivity with generative AI”, accessed at Unleash developer productivity with generative AI | McKinsey
⁶
Crawford, D., et. al. (2023): „Beyond Code Generation: More Efficient Software Development”, accessed at Beyond Code Generation: More Efficient Software Development | Bain & Company
⁷
Cui, Z., et. al. (2023): „ The Effects of Generative AI on High-Skilled Work: Evidence from Three Field Experiments with Software Developers”, accessed at https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4945566
⁸
Harding, W. (2025): “AI Copilot Code Quality: Evaluating 2024's Increased Defect Rate via Code Quality Metrics”, accessed at https://www.gitclear.com/ai_assistant_code_quality_2025_research
⁹
Uplevel (2024): “Can Generative AI Improve Developer Productivity.“, accessed at https://resources.uplevelteam.com/gen-ai-for-coding
¹⁰
Robinson, B. (2024): “Study Finds 77% of Employees Report AI Has Increased Their Workload”, accessed at https://www.forbes.com/sites/bryanrobinson/2024/07/23/employees-report-ai-increased-workload/
¹¹
Dakhel, A., et al. (2023): “GitHub Copilot AI pair programmer: Asset or Liability?”, arXiv preprint arxiv:2206.15331, accessed at https://arxiv.org/pdf/2206.15331.pdf
¹²
DORA (2025): „DORA Metrics: The Four Keys”, accessed at https://dora.dev/guides/dora-metrics-four-keys/.
¹³
LinearB (2025): „The SPACE Framework”, accessed at https://linearb.io/blog/space-framework.
¹⁴
inDriver (2025): „Productivity Measurement Framework”, accessed at https://github.com/inDriver/handbook/blob/main/docs/strategy-and-management/productivity-measurement-framework/devex-metrics.md.
¹⁵
Code Churn refers to the percentage of lines that are reverted or updated less than two weeks after being authored, which the study authors associate with “mistaken code” being added to the code base.
¹⁶
The large-language-model used by the chatbot was GPT3.5 Turbo, as newer models were not available at the time when the study was conducted.

Contact

Join our team.

Take a look at our job openings. We look forward to meeting you.

Write to us!

We are happy to answer your questions when you contact us using the contact form.

Form is loading...

Optional information

If you want to have more information on data processing at ZEISS, please refer to our data privacy notice.

Can Microsoft GitHub Copilot Enhance the Productivity of Medical Software Development?

And Accelerate Overall Solution Development Cycles?