Quick note on the AI guidance.. all 5 of them listed score better than GFS in testing. They are trained on analysis data from many years, then tested on a set of years (usually 2020 thru present). The evaluation is usually done using a metric called RMSE, which is also the metric used in training. RMSE/MSE really hammers a model for having drastically wrong forecasts but doesn't hammer as hard for forecasts that are slightly wrong. This essentially leads to most AI models being very accurate synoptically, but underestimating mesoscale extrema. They essentially punish models for taking risks, and leaning into extrama is a risk for the model.
So, typically with these models, you'll want to use them like you're using an ensemble mean. Temps will be slightly tended towards seasonal averages.. MSLP will tend towards non-extreme lows/highs.. etc. The real benefit of these models is synoptic-scale feature spatial interpretation... meaning cyclone tracks, large scale setups, etc.
With all of that being said, these models still score better than GFS in testing. GraphCast, AIFS, and Aurora score better than ECMWF op, and very close to EPS (sometimes exceeding EPS).. Pangu scores on par with ECMWF op. FourCastNet scores, generally, worse than ECMWF op (but better than GFS). These models are new, and it can be overwhelming to see so many different models, but they are certainly the future of meteorology. They are incredibly cheap to run compared to physics NWPs, and they keep improving at a rapid pace.
I would love to put together a guide for using these new models at some point. I build models of this genre for a private company, so I have been fully immersed in this emerging field for a few years.