超參數優化

機器學習中，超參數優化^[1]或整定（tuning）是為學習算法選擇一組最佳超參數的問題。超參數是用於控制學習過程的參數。

超參數優化會找到能產生最優模型的超參數元組，在給定的獨立數據上將預定義的損失函數最小化。^[2]目標函數獲取超參數元組，返回相關損失。^[2]交叉驗證常用於估算這種泛化性能，從而為超參數選擇一組能使其最大化的值。^[3]

方法

網格搜索

超參數優化的傳統方法是網格搜索（grid search）或參數掃描（parameter sweep），即對學習算法超參數空間中人工指定的子集暴力搜索。網格搜索算法必須以某些性能指標為指導，通常是在訓練集上交叉驗證^[4]或在保持驗證集上評估。^[5]

由於機器學習的參數空間可能包括參數的實值空間或無界值空間，因此在網格搜索前可能要手動設置邊界與離散化。

例如，配備RBF核的典型軟邊界SVM 分類器至少有兩個超參數需要調整，以便在未見數據上獲得良好性能：正則化常數C、核超參數γ。它們都是連續的，因此網格搜索要為每個參數選擇一組有限的「合理」值，如

C\in \{10,100,1000\}

\gamma \in \{0.1,0.2,0.5,1.0\}

然後，網格搜索會以兩個集合的笛卡爾積的每對 $(C,\ \gamma )$ 訓練SVM，並在不變的驗證集上評估性能（或在訓練集上進行內部交叉驗證，這時每對集合會訓練多個SVM）。最後，算法會輸出驗證中得分最高的超參數。

網格搜索會受到維數災難影響，但由於評估的超參數設置通常相互獨立，是過易並行的。^[3]

隨機搜索

隨機搜索以隨機選擇取代所有組合的暴力搜索。這可以簡單地應用於上述離散空間，也可推廣到連續空間和混合空間。對連續超參數，隨機搜索比網格搜索能探索更多的值，^[3]這時優化問題具有較低的內蘊維度。^[6]隨機搜索也是過易並行的，此外還允許指定採樣分佈以納入先驗知識。隨機搜索雖簡單，但仍是較新的超參數優化方法性能的重要基準。

貝葉斯優化

貝葉斯優化是一種針對噪聲黑盒函數的全局優化方法。貝葉斯優化用於超參數優化時建立超參數值到在驗證集上目標函數值的函數的概率模型。貝葉斯優化是要據當前模型，迭代地評估較好的超參數配置、再更新，收集儘可能多的觀察結果，揭示有關該函數的信息，尤其是最佳值的位置。它試圖在探索（結果最不確定的超參數）和利用（預期接近最優的超參數）之間取得平衡。實踐中，貝葉斯優化同前兩種算法相比^[7]^[8]^[9]^[10]，由於能在實驗前就推理實驗質量，因此能以更少的評估次數獲得更好的結果。

基於梯度的優化

對特定的學習算法，可以計算超參數的梯度，再用梯度下降法優化超參數。這類技術的首次使用主要在神經網絡。^[11]此後，到其他模型也有推廣，如支持向量機^[12]或邏輯回歸。^[13]

獲得超參數梯度的另一種方法是用自動微分，微分迭代優化算法的步驟。^[14]^[15]^[16]^[17]沿這種思路，最近的一項研究利用隱函數定理計算超梯度，提出了一種穩定的逆黑塞近似法，可推廣到數百萬個超參數，需要恆量的內存。^[18]

另一種方法是^[19]訓練超網絡以逼近最佳響應函數，這種方法也能處理離散超參數。自整定網絡^[20]為超網絡選擇一種緊表示，提供了一種內存效率更高的版本。最近，Δ-STN^[21]對超網絡進行輕微的重參數化，加速了訓練。Δ-STN還通過在權值中線性化網絡，從而消除權值大幅變化帶來的非線性影響，從而更好逼近最佳響應雅可比。

超網絡之外，梯度方法也可用於優化離散超參數，如對參數進行連續鬆弛。^[22]這類方法已被廣泛用於神經結構搜索中的結構超參數優化。

演化優化

演化優化是一種對噪聲黑盒函數進行全局優化的方法。^[8]超參數優化中，演化優化使用進化算法搜索給定算法的超參數空間，遵循受演化啟發的過程：

創建隨機解的初始種群（隨機生成超參數元組，通常超過100個）
取值，並計算其適值函數（如使用這些超參數的機器學習算法的10折交叉驗證準確率）
按相對適值排序超參數元組
用由交叉與變異產生的新超參數元組取代之前表現較差的
重複2-4，直到達到令人滿意的算法性能，或無法再提高為止

演化優化已被用於統計機器學習算法、^[8]自動機器學習、典型神經網絡、^[23]深度神經網絡結構搜索中的超參數優化^[24]^[25]，以及深度神經網絡中的權重訓練。^[26]

基於種群

種群訓練（Population-Based Training，PBT）同時學習超參數值和網絡權重。使用不同超參數的學習過程相互獨立進行。與演化算法類似，性能更差的模型會被迭代地替換為從性能較好的模型修改來的超參與權重的模型。替換模型熱啟動是PBT與其他演化算法的主要區別。因此，PBT允許超參數演化，無需手動調超參。此過程對模型結構、損失函數或訓練過程不做任何假設。 PBT及其變體是自適應方法：在模型訓練時更新超參。非自適應法的次優策略是在整個訓練過程中分配一組恆定的超參。^[27]

基於提前停止

8個任意超參配置進行連續減半。該方法從具有不同配置的8個模型開始，連續應用連續減半，直到只剩一個模型。

一類基於提前停止的超參優化算法專為大型超參數搜索空間設計，尤其是評估一組超參數的計算成本很高時。Irace實現了迭代競賽算法（iterated racing algorithm），將搜索重點放在最有前景的配置上，用統計測試剔除不佳的。^[28]^[29] 另一種提前停止超參數優化算法是連續減半算法（SHA），^[30]一開始是隨機搜索，但會定期修剪低性能模型，從而將計算資源集中到更有前景的模型上。異步連續減（ASHA）^[31]無需同步評估與修剪，從而進一步提高了SHA的資源利用率。Hyperband^[32]是一種更高級的基於提前停止的算法，可多次調用SHA或ASHA，具有不同程度的剪枝侵佔性（aggressiveness），因此適用範圍更廣，所需輸入也更少。

其他

RBF^[33]與譜方法^[34]也有開發。

超參數優化的問題

超參數優化結束後，通常會在訓練集上擬合一組超參數，然後根據在驗證集的泛化性能或得分來選擇。但這種方法可能使驗證集的超參數過擬合，因此驗證集（在交叉驗證中可以是多個）的泛化性能得分不能同時用於估算最終模型的泛化性能。為此，必須在獨立於優化超參的集合（且兩兩不交）上評估，否則性能值可能畸大。這可在第二個測試集上進行，也可由稱作嵌套交叉驗證的外部交叉驗證，對模型泛化性能進行無偏估計，同時考慮到超參數優化帶來的偏差。

另見

參考文獻

^ Matthias Feurer and Frank Hutter. Hyperparameter optimization. In: AutoML: Methods, Systems, Challenges, pages 3–38.
^ ^2.0 ^2.1 Claesen, Marc; Bart De Moor. Hyperparameter Search in Machine Learning. 2015. arXiv:1502.02127  [cs.LG].
^ ^3.0 ^3.1 ^3.2 Bergstra, James; Bengio, Yoshua. Random Search for Hyper-Parameter Optimization (PDF). Journal of Machine Learning Research. 2012, 13: 281–305.
^ Chin-Wei Hsu, Chih-Chung Chang and Chih-Jen Lin (2010). A practical guide to support vector classification. Technical Report, National Taiwan University.
^ Chicco D. Ten quick tips for machine learning in computational biology. BioData Mining. December 2017, 10 (35): 35. PMC 5721660 . PMID 29234465. doi:10.1186/s13040-017-0155-3 .
^ Ziyu, Wang; Frank, Hutter; Masrour, Zoghi; David, Matheson; Nando, de Feitas. Bayesian Optimization in a Billion Dimensions via Random Embeddings. Journal of Artificial Intelligence Research. 2016, 55: 361–387. S2CID 279236. arXiv:1301.1942 . doi:10.1613/jair.4806 （英語）.
^ Hutter, Frank; Hoos, Holger; Leyton-Brown, Kevin, Sequential Model-Based Optimization for General Algorithm Configuration, Learning and Intelligent Optimization (PDF), Lecture Notes in Computer Science 6683: 507–523, 2011, CiteSeerX 10.1.1.307.8813 , ISBN 978-3-642-25565-6, S2CID 6944647, doi:10.1007/978-3-642-25566-3_40
^ ^8.0 ^8.1 ^8.2 Bergstra, James; Bardenet, Remi; Bengio, Yoshua; Kegl, Balazs, Algorithms for hyper-parameter optimization (PDF), Advances in Neural Information Processing Systems, 2011
^ Snoek, Jasper; Larochelle, Hugo; Adams, Ryan. Practical Bayesian Optimization of Machine Learning Algorithms (PDF). Advances in Neural Information Processing Systems. 2012. Bibcode:2012arXiv1206.2944S. arXiv:1206.2944 .
^ Thornton, Chris; Hutter, Frank; Hoos, Holger; Leyton-Brown, Kevin. Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms (PDF). Knowledge Discovery and Data Mining. 2013. Bibcode:2012arXiv1208.3719T. arXiv:1208.3719 .
^ Larsen, Jan; Hansen, Lars Kai; Svarer, Claus; Ohlsson, M. Design and regularization of neural networks: The optimal use of a validation set (PDF). Neural Networks for Signal Processing VI. Proceedings of the 1996 IEEE Signal Processing Society Workshop. 1996: 62–71. CiteSeerX 10.1.1.415.3266 . ISBN 0-7803-3550-3. S2CID 238874. doi:10.1109/NNSP.1996.548336.
^ Olivier Chapelle; Vladimir Vapnik; Olivier Bousquet; Sayan Mukherjee. Choosing multiple parameters for support vector machines (PDF). Machine Learning. 2002, 46: 131–159. doi:10.1023/a:1012450327387 .
^ Chuong B; Chuan-Sheng Foo; Andrew Y Ng. Efficient multiple hyperparameter learning for log-linear models (PDF). Advances in Neural Information Processing Systems. 2008, 20.
^ Domke, Justin. Generic Methods for Optimization-Based Modeling (PDF). Aistats. 2012, 22 [2017-12-09]. （原始內容 (PDF)存檔於2014-01-24）.
^ Maclaurin, Dougal; Duvenaud, David; Adams, Ryan P. Gradient-based Hyperparameter Optimization through Reversible Learning. 2015. arXiv:1502.03492  [stat.ML].
^ Franceschi, Luca; Donini, Michele; Frasconi, Paolo; Pontil, Massimiliano. Forward and Reverse Gradient-Based Hyperparameter Optimization (PDF). Proceedings of the 34th International Conference on Machine Learning. 2017. Bibcode:2017arXiv170301785F. arXiv:1703.01785 .
^ Shaban, A., Cheng, C. A., Hatch, N., & Boots, B. (2019, April). Truncated back-propagation for bilevel optimization. In The 22nd International Conference on Artificial Intelligence and Statistics (pp. 1723-1732). PMLR.
^ Lorraine, J., Vicol, P., & Duvenaud, D. (2018). Optimizing Millions of Hyperparameters by Implicit Differentiation. arXiv preprint arXiv:1911.02590.
^ Lorraine, J., & Duvenaud, D. (2018). Stochastic hyperparameter optimization through hypernetworks. arXiv preprint arXiv:1802.09419.
^ MacKay, M., Vicol, P., Lorraine, J., Duvenaud, D., & Grosse, R. (2019). Self-tuning networks: Bilevel optimization of hyperparameters using structured best-response functions. arXiv preprint arXiv:1903.03088.
^ Bae, J., & Grosse, R. B. (2020). Delta-stn: Efficient bilevel optimization for neural networks using structured response jacobians. Advances in Neural Information Processing Systems, 33, 21725-21737.
^ Liu, H., Simonyan, K., & Yang, Y. (2018). Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055.
^ Kousiouris G, Cuccinotta T, Varvarigou T. The effects of scheduling, workload type and consolidation scenarios on virtual machine performance and their prediction through optimized artificial neural networks. Journal of Systems and Software. 2011, 84 (8): 1270–1291. doi:10.1016/j.jss.2011.04.013. hdl:11382/361472 .
^ Miikkulainen R, Liang J, Meyerson E, Rawal A, Fink D, Francon O, Raju B, Shahrzad H, Navruzyan A, Duffy N, Hodjat B. Evolving Deep Neural Networks. 2017. arXiv:1703.00548  [cs.NE].
^ Jaderberg M, Dalibard V, Osindero S, Czarnecki WM, Donahue J, Razavi A, Vinyals O, Green T, Dunning I, Simonyan K, Fernando C, Kavukcuoglu K. Population Based Training of Neural Networks. 2017. arXiv:1711.09846  [cs.LG].
^ Such FP, Madhavan V, Conti E, Lehman J, Stanley KO, Clune J. Deep Neuroevolution: Genetic Algorithms Are a Competitive Alternative for Training Deep Neural Networks for Reinforcement Learning. 2017. arXiv:1712.06567  [cs.NE].
^ Li, Ang; Spyra, Ola; Perel, Sagi; Dalibard, Valentin; Jaderberg, Max; Gu, Chenjie; Budden, David; Harley, Tim; Gupta, Pramod. A Generalized Framework for Population Based Training. 2019-02-05. arXiv:1902.01894  [cs.AI].
^ López-Ibáñez, Manuel; Dubois-Lacoste, Jérémie; Pérez Cáceres, Leslie; Stützle, Thomas; Birattari, Mauro. The irace package: Iterated Racing for Automatic Algorithm Configuration. Operations Research Perspective. 2016, 3 (3): 43–58. doi:10.1016/j.orp.2016.09.002 . hdl:10419/178265 .
^ Birattari, Mauro; Stützle, Thomas; Paquete, Luis; Varrentrapp, Klaus. A Racing Algorithm for Configuring Metaheuristics. Gecco 2002. 2002: 11–18.
^ Jamieson, Kevin; Talwalkar, Ameet. Non-stochastic Best Arm Identification and Hyperparameter Optimization. 2015-02-27. arXiv:1502.07943  [cs.LG].
^ Li, Liam; Jamieson, Kevin; Rostamizadeh, Afshin; Gonina, Ekaterina; Hardt, Moritz; Recht, Benjamin; Talwalkar, Ameet. A System for Massively Parallel Hyperparameter Tuning. 2020-03-16. arXiv:1810.05934v5  [cs.LG].
^ Li, Lisha; Jamieson, Kevin; DeSalvo, Giulia; Rostamizadeh, Afshin; Talwalkar, Ameet. Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization. Journal of Machine Learning Research. 2020-03-16, 18: 1–52. arXiv:1603.06560 .
^ Diaz, Gonzalo; Fokoue, Achille; Nannicini, Giacomo; Samulowitz, Horst. An effective algorithm for hyperparameter optimization of neural networks. 2017. arXiv:1705.08520  [cs.AI].
^ Hazan, Elad; Klivans, Adam; Yuan, Yang. Hyperparameter Optimization: A Spectral Approach. 2017. arXiv:1706.00764  [cs.LG].

[1] Matthias Feurer and Frank Hutter. Hyperparameter optimization. In: AutoML: Methods, Systems, Challenges, pages 3–38.

[abs1502.02127-2] 2.0 ^2.1 Claesen, Marc; Bart De Moor. Hyperparameter Search in Machine Learning. 2015. arXiv:1502.02127  [cs.LG].

[bergstra-3] 3.0 ^3.1 ^3.2 Bergstra, James; Bengio, Yoshua. Random Search for Hyper-Parameter Optimization (PDF). Journal of Machine Learning Research. 2012, 13: 281–305.

[4] Chin-Wei Hsu, Chih-Chung Chang and Chih-Jen Lin (2010). A practical guide to support vector classification. Technical Report, National Taiwan University.

[5] Chicco D. Ten quick tips for machine learning in computational biology. BioData Mining. December 2017, 10 (35): 35. PMC 5721660 . PMID 29234465. doi:10.1186/s13040-017-0155-3 .

[6] Ziyu, Wang; Frank, Hutter; Masrour, Zoghi; David, Matheson; Nando, de Feitas. Bayesian Optimization in a Billion Dimensions via Random Embeddings. Journal of Artificial Intelligence Research. 2016, 55: 361–387. S2CID 279236. arXiv:1301.1942 . doi:10.1613/jair.4806 （英語）.

[hutter-7] Hutter, Frank; Hoos, Holger; Leyton-Brown, Kevin, Sequential Model-Based Optimization for General Algorithm Configuration, Learning and Intelligent Optimization (PDF), Lecture Notes in Computer Science 6683: 507–523, 2011, CiteSeerX 10.1.1.307.8813 , ISBN 978-3-642-25565-6, S2CID 6944647, doi:10.1007/978-3-642-25566-3_40

[bergstra11-8] 8.0 ^8.1 ^8.2 Bergstra, James; Bardenet, Remi; Bengio, Yoshua; Kegl, Balazs, Algorithms for hyper-parameter optimization (PDF), Advances in Neural Information Processing Systems, 2011

[snoek-9] Snoek, Jasper; Larochelle, Hugo; Adams, Ryan. Practical Bayesian Optimization of Machine Learning Algorithms (PDF). Advances in Neural Information Processing Systems. 2012. Bibcode:2012arXiv1206.2944S. arXiv:1206.2944 .

[thornton-10] Thornton, Chris; Hutter, Frank; Hoos, Holger; Leyton-Brown, Kevin. Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms (PDF). Knowledge Discovery and Data Mining. 2013. Bibcode:2012arXiv1208.3719T. arXiv:1208.3719 .

[11] Larsen, Jan; Hansen, Lars Kai; Svarer, Claus; Ohlsson, M. Design and regularization of neural networks: The optimal use of a validation set (PDF). Neural Networks for Signal Processing VI. Proceedings of the 1996 IEEE Signal Processing Society Workshop. 1996: 62–71. CiteSeerX 10.1.1.415.3266 . ISBN 0-7803-3550-3. S2CID 238874. doi:10.1109/NNSP.1996.548336.

[12] Olivier Chapelle; Vladimir Vapnik; Olivier Bousquet; Sayan Mukherjee. Choosing multiple parameters for support vector machines (PDF). Machine Learning. 2002, 46: 131–159. doi:10.1023/a:1012450327387 .

[13] Chuong B; Chuan-Sheng Foo; Andrew Y Ng. Efficient multiple hyperparameter learning for log-linear models (PDF). Advances in Neural Information Processing Systems. 2008, 20.

[14] Domke, Justin. Generic Methods for Optimization-Based Modeling (PDF). Aistats. 2012, 22 [2017-12-09]. （原始內容 (PDF)存檔於2014-01-24）.

[abs1502.03492-15] Maclaurin, Dougal; Duvenaud, David; Adams, Ryan P. Gradient-based Hyperparameter Optimization through Reversible Learning. 2015. arXiv:1502.03492  [stat.ML].

[16] Franceschi, Luca; Donini, Michele; Frasconi, Paolo; Pontil, Massimiliano. Forward and Reverse Gradient-Based Hyperparameter Optimization (PDF). Proceedings of the 34th International Conference on Machine Learning. 2017. Bibcode:2017arXiv170301785F. arXiv:1703.01785 .

[17] Shaban, A., Cheng, C. A., Hatch, N., & Boots, B. (2019, April). Truncated back-propagation for bilevel optimization. In The 22nd International Conference on Artificial Intelligence and Statistics (pp. 1723-1732). PMLR.

[18] Lorraine, J., Vicol, P., & Duvenaud, D. (2018). Optimizing Millions of Hyperparameters by Implicit Differentiation. arXiv preprint arXiv:1911.02590.

[19] Lorraine, J., & Duvenaud, D. (2018). Stochastic hyperparameter optimization through hypernetworks. arXiv preprint arXiv:1802.09419.

[20] MacKay, M., Vicol, P., Lorraine, J., Duvenaud, D., & Grosse, R. (2019). Self-tuning networks: Bilevel optimization of hyperparameters using structured best-response functions. arXiv preprint arXiv:1903.03088.

[21] Bae, J., & Grosse, R. B. (2020). Delta-stn: Efficient bilevel optimization for neural networks using structured response jacobians. Advances in Neural Information Processing Systems, 33, 21725-21737.

[22] Liu, H., Simonyan, K., & Yang, Y. (2018). Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055.

[kousiouris1-23] Kousiouris G, Cuccinotta T, Varvarigou T. The effects of scheduling, workload type and consolidation scenarios on virtual machine performance and their prediction through optimized artificial neural networks. Journal of Systems and Software. 2011, 84 (8): 1270–1291. doi:10.1016/j.jss.2011.04.013. hdl:11382/361472 .

[miikkulainen1-24] Miikkulainen R, Liang J, Meyerson E, Rawal A, Fink D, Francon O, Raju B, Shahrzad H, Navruzyan A, Duffy N, Hodjat B. Evolving Deep Neural Networks. 2017. arXiv:1703.00548  [cs.NE].

[jaderberg1-25] Jaderberg M, Dalibard V, Osindero S, Czarnecki WM, Donahue J, Razavi A, Vinyals O, Green T, Dunning I, Simonyan K, Fernando C, Kavukcuoglu K. Population Based Training of Neural Networks. 2017. arXiv:1711.09846  [cs.LG].

[such1-26] Such FP, Madhavan V, Conti E, Lehman J, Stanley KO, Clune J. Deep Neuroevolution: Genetic Algorithms Are a Competitive Alternative for Training Deep Neural Networks for Reinforcement Learning. 2017. arXiv:1712.06567  [cs.NE].

[27] Li, Ang; Spyra, Ola; Perel, Sagi; Dalibard, Valentin; Jaderberg, Max; Gu, Chenjie; Budden, David; Harley, Tim; Gupta, Pramod. A Generalized Framework for Population Based Training. 2019-02-05. arXiv:1902.01894  [cs.AI].

[irace-28] López-Ibáñez, Manuel; Dubois-Lacoste, Jérémie; Pérez Cáceres, Leslie; Stützle, Thomas; Birattari, Mauro. The irace package: Iterated Racing for Automatic Algorithm Configuration. Operations Research Perspective. 2016, 3 (3): 43–58. doi:10.1016/j.orp.2016.09.002 . hdl:10419/178265 .

[race-29] Birattari, Mauro; Stützle, Thomas; Paquete, Luis; Varrentrapp, Klaus. A Racing Algorithm for Configuring Metaheuristics. Gecco 2002. 2002: 11–18.

[30] Jamieson, Kevin; Talwalkar, Ameet. Non-stochastic Best Arm Identification and Hyperparameter Optimization. 2015-02-27. arXiv:1502.07943  [cs.LG].

[31] Li, Liam; Jamieson, Kevin; Rostamizadeh, Afshin; Gonina, Ekaterina; Hardt, Moritz; Recht, Benjamin; Talwalkar, Ameet. A System for Massively Parallel Hyperparameter Tuning. 2020-03-16. arXiv:1810.05934v5  [cs.LG].

[32] Li, Lisha; Jamieson, Kevin; DeSalvo, Giulia; Rostamizadeh, Afshin; Talwalkar, Ameet. Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization. Journal of Machine Learning Research. 2020-03-16, 18: 1–52. arXiv:1603.06560 .

[abs1705.08520-33] Diaz, Gonzalo; Fokoue, Achille; Nannicini, Giacomo; Samulowitz, Horst. An effective algorithm for hyperparameter optimization of neural networks. 2017. arXiv:1705.08520  [cs.AI].

[abs1706.00764-34] Hazan, Elad; Klivans, Adam; Yuan, Yang. Hyperparameter Optimization: A Spectral Approach. 2017. arXiv:1706.00764  [cs.LG].

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]