From vision to value: How AI is taking shape in mortgage
For mortgage executives, the challenge with AI is no longer whether to invest, but how to scale responsibly and effectively. Join Tela Gallagher Mathias of PhoenixTeam and Chris McEntee of ICE Mortgage Technology for a strategic discussion on ICE’s AI execution and the broader state of AI adoption across the industry. The webinar will highlight where lenders and servicers are realizing value, the risks that often emerge during execution and the strategic decisions that shape long-term success.
What you’ll learn:
Mortgage leaders are past the question of whether AI matters. The real challenge now is how to deploy it in ways that scale, manage risk and deliver measurable value. This webinar offers a grounded, strategic look at what that takes.
Viewers will gain insight into how ICE and PhoenixTeam is approaching AI across the mortgage lifecycle and hear an independent view of how lenders and servicers are adopting AI in practice today. Drawing from real-world examples, the discussion will highlight where AI initiatives are delivering results, where they commonly break down and what leadership teams need to consider as they move from experimentation to execution.
Introducing Large Language Models to Traditional Machine Learning Operations
Machine learning (ML) operations (MLOps) is the set of practices, tools, and associated culture that bridges the gap between building ML models and running them reliably in production. MLOps is as much about science and engineering as it is about systems thinking. Some of learnings in this article come from a session at GTC focused on explaining why MLOps is needed, and what changes with MLOps when generative technologies are introduced.
A traditional MLOps approach starts with problem definition and culminates with continuous monitoring and validation, like most (good) product development lifecycles.
Why does MLOps matter?
Let’s start with why this matters. Especially in financial services, we have to be able to answer at least four key questions:
What is this model doing and where is it in production?
What was it trained and validated on, and is production still consistent with that?
Who approved it, who owns it now, and what changed since approval?
Is it still performing safely, fairly, and within policy, and what happens if it does not?
MLOps is at the heart of many of these questions. Model management is not new to us in mortgage and has significantly evolved from the early days of automated underwriting and securitization. With the financial crisis of 2008, model risk became increasingly central to fair lending, risks assumptions, and stress testing. The introduction of generative models into our ecosystem only increases the importance of good operational practices around this new kind of ML system, hence the idea of MLOps as central to the new AI-powered mortgage ecosystem.
Famous Examples of MLOps Gone Wrong
There are many examples of MLOps "gone wrong", or cases where an engineering operations focus could have created better outcomes. A few examples mentioned at the NVIDIA conference:
Ariane 5 Flight 501 (1996): Ariane 5 was a European rocket, and on its first launch it went off course and had to be destroyed less than a minute after takeoff. Part of its guidance software accepted a value larger than it was built to handle, causing the system to fail at a critical moment. The MLOps lesson is that a system can break fast when real-world conditions go beyond what the system was designed and tested for.
Mars Climate Orbiter (1999): NASA intended for the Mars Climate Orbiter to, well, actually orbit Mars, but instead it burned up in the Martian atmosphere. One part of the system was using English measurement units and another was expecting metric units. The numbers appeared valid but in execution meant different things. The MLOps lesson is that a system can when different teams or tools are not using the same measurement rules.
Millennium Bridge (2000): Under pedestrian traffic shortly after it opened, London's Millennium Bridge swayed side to side much more than expected and had to be closed for safety reasons. Engineers found that once the bridge began moving under foot traffic, people naturally adjusted how they walked, these unexpected lateral forces unintentionally made the side-to-side motion worse. The MLOps lesson is that a system can seem fine in design and testing, then behave very differently once real people interact with it at scale.
Knight Capital (2012): A bad software rollout caused Knight Capital's core financial system to initiate huge numbers of unintended trades. During the first 45 minutes of trading, the system turned 212 customer orders into more than four million orders and led to more than $460 million in losses. This demonstrated that a system can fail not because the idea is wrong, but because a bad production rollout can make the live system behave in a completely different way than intended. The MLOps connection here is less about machine learning itself and more about how you safely release and control live decision systems.
Failure Through an MLOps Lens
It was helpful for me to see this diagram explained within the context of the spectacular failures as explained by serious MLOPs engineers.
Looking at a robust MLOps framework like the one used at NVIDIA, we can pinpoint the failure points.
Ariane 5 failed because the live system encountered values outside what the software could handle. In this framework, that should have been caught by defining the operating limits early and then testing the system against extreme but plausible conditions during validation and simulation.
Mars was really an interface meaning problem: the numbers were there, but one side meant English units and the other meant metric. In this framework, that belongs first in data federation, cleaning, and labeling, where data definitions and contracts should be aligned, and then in validation and simulation, where those handoffs should be tested before deployment.
The bridge looked fine until real people started walking on it, which created a feedback loop nobody had fully accounted for. In this framework, that means you need stronger validation and simulation before launch, but also continuous monitoring and validation after release because some behaviors only appear in the real world at scale.
Knight Capital was mainly a bad production rollout problem. In this framework, the strongest controls should have been in production deployment, making sure the release was consistent and safe, and in continuous monitoring and validation, so the issue was detected and stopped immediately.
How MLOps Changes with Generative Models
Generative models are simply a new and more complicated model to manage. Even a simple retrieval augmented generation (RAG) bot is actually a relatively complicated MLOps system.
We still start with problem definition, that hasn't changed. And we still have to make our data "ready" for use in retrieval. This data, which will be used to enrich and further contextualize the base knowledge of the large language model, has to be converted to vector embeddings, which requires, chunking, tokenization, and use of an embedding model. All steps in an MLOps pipeline flow. We should probably also store the original natural language data, yet another thing to manage. And don't forget change, how we govern the process of updating our embeddings is kind of a big deal. More MLOps.
We have this new problem of prompt management and evaluation, which requires golden standard data sets with both the questions and the answer (also called QA pairs for question and answer). It also requires us to have a good natural language versioning solution, again because the problem of change and optimization is a big deal. Prompts are a significant asset to understand, version, evaluate, test, optimize and control.
And of course model management, not exactly new in a generative scenario, but perhaps more fluid. Even if we are not training foundation models, we still have to preserve optionality, adapt the behavior of the model (perhaps going so far as to fine tune it), and customize how it behaves under real workloads and in the real world. More work with the intersections of prompting, data, and evaluation. Ensuring the model performs with edge cases, unanticipated scenarios, and over time (detecting and addressing model drift) would all be the purview of MLOps.
Although not expressly called out in our diagram above, latency management seems more significant with the addition of generative technologies. Users have come to expect instant answers, regardless of how complicated the request it. We have to think about when to offload workloads to an asynch process, when to use streaming so users can at least see the answer as it builds, paralellization.
Finally guardrails, again not exactly new to traditional MLOps, but something acutely important for understanding, controlling, and evidencing an AI system. Proving the responsible implementation, demonstrating safety, evidencing harm prevention - all essential to managing a good AI system, and all the purview of MLOps if you want it done well.
By Tela Mathias, CTO & Chief Nerd and Mad Scientist at PhoenixTeam
GTC Insights | Visual AI Agents for Real Time Video Understanding
Vision AI is a not-new-but-hotter-than-it-used-to-be capability. The thing that changed is the opportunity in physical AI, as physical AI gets increasingly real. As a refresher, vision AI is AI that can look at images or video and understand what is in them, while physical AI is does not just understand information – it can sense, reason, and act in the real world through machines like robots, cars, drones, or factory systems. Vision AI might see a box, but physical AI can see the box, decide what to do with it, and then pick it up or move around it.
The Problems Working with Video
Low search accuracy is a problem with traditional video searching. This is because traditional search is limited to trained attributes. With a single embedding model, we can move search from retrieval based on trained attributes to generative. Traditional approaches might fire alters when a triggering event happens in a video (think a dog walks by your security camera and you get an alert). This is great, expect we all get “alert fatigue” amirite? At first you look at every alert and then, after a while, you look at nothing. It is difficult for a human to filter the important alerts (sketchy looking human passing the video camera in the middle of the night) from the unimportant (squirrel!). We want to find the true positives, things that need to be escalated, and things where we need to act. It’s the classic needle in a haystack problem.
What is Video Search and Summarization (VSS)?
Video search and summarization (VSS) is a set of vision aware tools that can connect to agents so the agents can understand what they are seeing. Think about tools for decomposing, searching, retrieving, critiquing, and summarizing context of a bunch of video. VSS is a hard problem – think about what you would have to do to summarize a video – you’d have to find it, watch it, figure out what’s important, maybe watch it again, pick out some highlights, summarize it, and then check your work. That’s what a VSS does. And it can do that for a one-hour video in about six minutes. Wow.
We will see this capability enter mortgage in policy procedure generation, job redesign, and opportunity analysis.
The Value Proposition
The KPIs here were pretty strong. 80% quicker onboarding for a training company, 80% reduction in incident reporting fatigue in a manufacturing company, 95% cost reduction for a training company. Let’s take an example, imagine you want to find all the places in a soccer game where a particular person scored a goal. You know how you would do that and can estimate how long it would take. If the game is an hour, it would take… at least an hour.
This will be a little like the Ronco rotisserie - set it and forget it. Get the pipeline going and continuously feed it data. Wake up the next day and see what it found. Amazing.
Now imagine that you could process the video in six minutes and then ask a natural language questions and get verified evidence of every time that person score a goal. All in about six minutes to process the video, and two seconds to answer the questions (with evidence!). That’s literally a 90% performance gain, conservatively.
The Technical Solution Architecture
It won’t surprise you by now that “there’s an NVIDIA solution for that” and it’s their VSS blueprint. NVIDIA VSS turns video from passive footage into an AI-readable system you can search, summarize, and act on. It is interesting because it treats video as a first-class enterprise data source, combining computer vision, VLMs, LLMs, and RAG so organizations can move from watching footage to querying and operationalizing it.
I know it looks a little overwhelming, but it really isn't. Just read the icons top to bottom, left to right.
GTC Insights | Open Source and Agentic Development
I vividly remember last year having a light bulb moment when Jensen provided his definition of an AI agent. An AI agent is an AI that can perceive, plan, reason, and act. For some reason, it took me hearing it from Jensen in context to really get it. I attended multiple sessions at GTC this year on a family of open source models designed specifically for agentic development, which deepened my understanding of accelerated computing, what it means to put AI “at the edge”, how a generative model is trained, and what agentic development really is.
Approximating Human Agency
Perceive, plan, reason, and act. As humans, we do these things intuitively and without thinking. We just know how to do it. Our ability to perceive our environment relies on vision, hearing, our sense of touch, and our ability to translate this sensory information into meaning. Planning requires us to have executive functioning. This higher order mental ability allows us to make sense of data and organize it into a logical framework. As we reason through a problem, we consider alternatives and think back on prior experiences. And when we act, we intersect our inner world with the physical world. We use tools. We apply skills we have honed over time and based on our experience.
This intuitive set of human processes relies on multiple complicated systems. It takes a system of systems to make this possible. This is really the essence of agentic development – how to create a system of systems that proactively work together to meet our objectives without being explicitly told how.
I want you to really think about how you do what you do and then put that in the context of how agents work.
Let's look at an example with a deep research agentic system. NVIDIA provides solution blueprints, models to implement, and packaged AI inference microservices. (They call these NIM – NVIDIA Inference Microservices). You may want to ask your vendor partners if they are using these blueprints, they really are a wealth of actionable information and tools. They also memorialize industry best practices.
This is a zoomed out view of the "AIQ" blueprint from NVIDIA. It's an agentic system for deep research.
To put this in context, thank about a set of helpdesk deep research agents in a call center. There could be an intake agent, a triage agent, an escalation agent, and a prior cases agent. All of these are deep research agents. These four agents would each be designed with skills and tools. They are small and contained to maximize their effectiveness and prevent the agent from “wandering off”.
For an agent to be effective, it must have authority, which makes the agent more difficult to secure. This is called the agent paradox. The default pattern in good agent design is to DENY permissions, and ALLOW only as a thoughtful, secured choice.
Human Analogies for Open Source, Accelerated Computing, and the Edge
You can think of a human as a closed source model. We are opaque by default. I perceive a set of inputs and take an action as a result. Even if I explain how I did it, the true technical mechanisms are largely closed to other humans. Even a neurologist only has a very selective view into the complex organic relationships that make human agency possible.
Only took four iterations to get NotebookLM to give me this version. It came out ok, not perfect but it will do.
But what if our inner world was transparent? What if we could break ourselves down and inspect the way we do things? What if we could look at all the experiences we had over a lifetime that caused us to make a particular decision? What if we could double click on each other? Think of that as open source. If we could understand the pieces and parts, we could identify weaknesses, improve results, remove bottlenecks. We could make systems work more effectively, faster. Think of that as accelerated computing. And if we could do that not just with internal processes, but within the context of physical reality, we could also impact the world. You can think of that as “the edge” – that place in human reality that intersects with physical reality. (I'm definitely not saying I would want open source people, by the way. Just trying to create an analogy).
Open Source Model Development
I had not realized that NVIDIA was in the model making business. NVIDIA Nemotron is a family of open-source, foundation models and datasets designed to build and deploy agentic AI systems. Now that we understand how open source allows us to better understand and optimize how things work, we can start to see why NVIDIA might want to build models and expose them to others.
Building models allows them to anticipate what their clients will need by experiencing those needs firsthand. Designing, pre-training, training, post-training, deploying, and supporting a family of models requires a kind of learning that can only be gained by doing. It required NVIDIA to become experts at what is needed to design the requisite AI infrastructure and to accelerate the ecosystem around that infrastructure. Knowing what to build and having a perspective on what needs to be built. Frankly, this is one of the many reasons we build products at PhoenixTeam, and why I spend weekends and early hours tinkering.
Making models open source invites others into the process who likely see things differently and can make different contributions to the process. The OpenClaw moment was a shining example of the impact a small, community project can have. It showed us how one small project can create a robust and diverse ecosystem (that requires NVIDIA solutions, of course). Hence the rationale for making open source models.
What is the significance of data in this equation, and the fact that NVIDIA released the data? They found that 75% of the compute required during the process are actual required in synthetic test data degeneration and running experiments. Remember, we are running out of data and data scarcity of training is a big deal.
Base model development (pretraining) is all about KNOWLEDGE and post-training is all about BEHAVIOR and ensuring the model behaves in the way "we think it should".
There are four basic stages to training a generative model, and Nemotron as a family of models is really an approach. In addition to models, NVIDIA has also released the training and fine tuning data and the techniques. This goes with the model family. This model family includes everything you need to design and deploy agents, most of which can run locally if wish (except the largest model in the family, which is 253B parameters and is too big to run on a DGX Spark). But it also has smaller, specialized models for reasoning, speech, and rag. I think my new DGX Spark comes Nemotron ready.
Accelerated Computing
Accelerated computing is this idea that you can take certain parts of a system workload and offload them to accelerated processing that uses a different type of processor (the GPU – graphic processing unit). Accelerated computing speeds up the right workloads enormously, but if you apply it to the wrong things, it can slow most things down. That is why Jensen talks so much about identifying the parts of the stack that are truly “accelerable” instead of assuming every workload belongs on a GPU. Wasted effort makes AI that is dumber. Think of accelerated computing as steps in a chain, and the goal is to accelerate as many of those steps as possible. So, the data set used for training is a link in the chain of acceleration.
Accelerated computing is this idea that you can take certain parts of a system workload and offload them to accelerated processing that uses a different type of processor.
Accelerated computing slows most things down. This really resonates with me. The solution to all problems is not a large language model. Every problem doesn’t need an agent. Accelerated computing has a point of view, focus, and specialization. So the idea is to identify what computations in the future will be important and can/should be accelerated. Think beyond the GPU and stick with first principles. Preach!
You can design for acceleration, which is what they did with Nemotron. It was one of the up front design principles. Nemotron was also designed for mixture of experts (MOE) from the beginning and designed to deal with “numerics” and “sparsity”. To be designed for numerics means NVIDIA designed Nemotron so it works well with the messy realities of GPU math, not just idealized floating-point math on paper. To be designed for sparsity means the models are structured so that only the needed parts fire, and the system is built to make that selective firing efficient instead of wasteful.
In a model that is more closed, you can really only alter the behavior through prompt engineering and model fine tuning (and whatever other capabilities the model provider shooses to allow you to have). There can be very inefficient parts of the model, it's processes, or its data that simply cannot be altered. Accelerated computing is about making the chain go faster, if parts of the chain are hidden then you lose that opportunity for acceleration.
Open Source at the Edge
I never really understood what the edge even was until last year. And I really didn't get the socalled "internet of things (IoT)". The edge is where digital meets our world. A scanner in a grocery store. A sensor on a shelf in a warehouse. To put generative capabilities on "the edge" is to deploy models that are small enough to be practical and also smart enough to be worth it. That's a serious engineering problem.
This is what makes my new DGX Spark so exciting. Some of you will remember my journey to deploy a 70 billion parameter model on my massive Mac pro. I tried to run a bigger one but my machine did not have sufficient compute. DGX Spark is a small desktop AI computer from where you can run, test, and fine-tune large AI models locally. It’s basically a very powerful “AI box for your desk,” built specifically for AI workloads rather than general-purpose computing. And the best part is you can air gap it. I picked one up at the NVIDIA swag store for an amount I will only tell you if you ask me directly. I can say it's fully five times less expensive than my Mac pro.
I call him Sparky. He will be a safe, secure place to run my OpenClaw agents and take my tinkering to the next level.
I also learned about something I want to look into for our commercial product, which is Llama Nemotron Nano VL. I had to do a bit of research on this, but they described it as a “tiny open VLM that rivals closed models in doc extraction” (VLM is vision language model). I want to try this one out for sure.
Open source at the edge means more opportunity to accelerate. Given that the edge requires seriously compact and optimized solution, you can see that being able to engineer the inner parts of the model could be the difference that's needed to make edge computing feasible.
Open Source as Component of Mixture of Expert (MOE) Architecture
Model selection is not an “or” but an “and”. It’s not open or closed, this model or the other, it’s open AND closed, this model AND these other models. We are creating or at least using systems of models. When you use any of the (closed) foundation models, you are really using multiple models to do different things – that complexity, however, is abstracted from that interaction. That is part of the value of working with a closed foundation model.
However, it’s not one size fits all. More and more (I’d say approaching table stakes), we are seeing “mixture of experts (MOE)” approaches where we use different models to do different things, in the same overall system. Maybe a VLM for document analysis, Claude for general knowledge, and a domain specific model to solve a particular type of problem. We are seeing more and more usefulness of this model when we have specialized AI solutions (i.e. a general model isn’t the best) or when we are looking for efficiency (again, a general model is not terrifically efficient).
There are also MOE approaches where you use different models for different problems, as opposed to this pattern where the models are used to come up with a composite score.
About the data – the most valuable data in an organization ALWAYS has the most restrictions. And the argument for open here is that we need more diversity to optimize for unique and specialized data and domain needs. We simply may not trust a closed model whose inner working we do not understand, whose training data we have no visibility into, to handle our most valuable data.
Nemotron Performance Engineering
For the more technical of you readers, Nemotron is a pretty solid model family according to all the benchmarks, how this was achieved:
Hybrid transformer architecture – A hybrid transformer architecture means the AI model uses more than one kind of “thinking part” instead of just the standard transformer setup. In Nemotron, that helps it handle long inputs and run more efficiently while still staying strong at reasoning and language tasks.
Multi-token prediction (take advantage of free tokens) – multi-token prediction helps the model look a few words ahead, and when those guesses check out, you get more output with less waiting. They are not literally free in money terms, they are “free” in the sense of extra tokens for almost the same decoding work.
1M context length – this means the model can keep about 1 million tokens of text in its working memory at one time (~750,000 words). The model can reason across very long documents, large codebases, long chat histories, or many retrieved documents at once instead of chopping everything into lots of smaller pieces.
Latent MOE – latent means a smaller internal representation of the token, not the full version the model normally uses. In latent MOE, the model compresses the token into the latent space, does the work, then projects it back up, saving bandwidth and compute.
It’s fastest on PinchBench and has frontier level accuracy. PinchBench is a benchmark for testing how well language models perform as OpenClaw agents. Seems worth trying out for me. DGX Spark here I come.