Your question of how to prove the zillion pages of results is the desired & expected result is a good one.
35 years ago, I wrote an address correction program for a Baby Bell that was hitting a 98% accuracy rate for building their marketing database from their customer master billing system data base. From the 1.5 million customers on the first states' data base, I randomly extracted about 5,000 address for my initial testing. And yes, I manually compared the 'before and after' results of all 5K addresses multiple times until I was satisfied that I had handled all 5K of them correctly. I had the luxury of writing the IMS customer database 'extract' to DB/2 marketing DB 'loading/update' system, so I had access to all the data any time I needed to. I started browsing the addresses at work and found such 'winners' as PO Street, Rural Lane, quadrant street addresses like Atlanta and Washington DC use, grid addresses such as N123W12345 Rt 144, and street names like 21W and W21. I naturally added some of those to my testbed and verified the results. I extracted perhaps 10K additional address and verified the results before I was ready to market the product to the Baby Bell for $110K, which they leased, etc. During full system testing, as well as after going into production, I periodically performed queries to effectively browse the addresses within a manually chosen 'random' ZIP code to verify all was working as intended. The number of surprises (errors) each time I did this dropped each time after I made appropriate software changes to expect oddball addresses unique to some small towns like 'Avenue A' or even 'Elm St #1' and 'Elm St #2' as two legitimate street names in one town. The #1 and #2 are NOT unit or apartment numbers! Fortunately, there weren't any apartment houses on those streets. 'Off Main' was another surprise street name that had to be dealt with, too.
Where am I leading to with this? With a potential of, say, 20 million possible city pairs and sequences of clicks, unless a full division of soldiers is available to manually check every possible result for accuracy, the only 'reasonable' solution is start with a manageable number of city pairs train numbers, passenger types and numbers of passengers. Manually prove those are working. Granted, verifying that a trip from NYP to BAL shouldn't cost $500 but should be in the $77-150 dollar range for regional trains two weeks from now, will be a very time consuming manual task. Perhaps the best solution to build the 'correct results' data base would be for a group of people to manually perform the ticketing functions and get the 'current' (ie, real Amtrak.com) results and that gets put into the data base as the 'valid' starting point. The trick would be to us the production Amtrak booking system with a separate test data base of trains, numbers of already booked passengers, etc, to be able get the various price adjustments (buckets) to show up. It would also be necessary to 'reset' the test booking data base to the same 'date and time' prior to testing, otherwise some options may be sold out, prices increased, etc.
Which brings me to variable results for the same screen input choices. Obviously, starting with a 'currently booked database' for 7/1/19 at 12:01AM and forcing the current 'clock' to that time is a requirement for test scripts with thousands of possible choices. That way, it could be 'forced' to 'keep buying' a coach seat on train <whatever> from NYP-BAL until it jumps to the next bucket price and even to a sold-out status. It's also necessary to create entered data that is invalid or unreasonable, such as trying to book tickets for a group of 500 on a single train. Each of the data field edits must be proven to be functioning correctly as well. Back in the punched card days, it was commonplace to reach into the trash can and grab a bunch of cards and run them in as 'update data' or 'billing system imput' to conclusively prove the edits were kicking out the bad data and letting good data get in. Getting back to that 'Baby Bell', some of their data field edits for the customer screen input to the billing system were as simple as 'not blank' for a persons' first, last, middle initial, address, and even city name! The address correction program found over 130 ways to spell Chicago on the IMS database! Some of the city name abbreviations and misspellings were downright hilarious! GIGO reigned! (garbage in, garbage out). When I told the VP in charge of the customer data base and billing system that I could easily do an update to correct to the IMS database, I was immediately told where to shove my idea and where I was to go in no uncertain terms. Several years later, while driving in Seattle, I passed by a 'head shop' with a bumper sticker that 'killed' me. It had a Bell System company logo and adjacent to it written: "We don't care. We don't have to!" How true it was. Sometimes I think one should be made for Amtrak and other big companies, too.
So how does one write a script for entering XYZ instead of the number of adult passengers, or entering YYZ (Toronto airport code, if memory serves) as a destination station? How does one 'force' a 'sold out' condition other than 'purchasing' some big number of tickets? I would suggest they'd have to be manually-written scripts to do this that are added to the zillion other computer generated scripts.
In short, I think it would be unreasonable to manually verify the 'initial' run of the automated testing driver program due to the millions of possibilities. However, starting with perhaps 2-3K possibilities, verifying that, then adding more possibilities say, 500 at a time and verifying that (making necessary coding adjustments as needed), would be reasonable until perhaps a 'representative sampling' of 20-50K scripts was achieved. I forget what they called the mathematical 'rule' of "if the formula works for N and N+1, then it works for N=infinity" or something like that. Of course, Amtrak ticketing isn't a linear process. It's a giant tree-branching system. Then, throw in simultaneous servers around the USA processing hundreds, if not thousands of ticketing requests per minute, all trying to 'grab' the last ticket on train 123 on the 4th, and the 'fun' gets 'verrrry eeenteresting'. But then, 'server wars' are not a part of the automated testing problems to be solved.