Introduction

The reality of climate change is quickly becoming apparent1,2 as an increasing number of people are experiencing the various impacts of climate change through hazards, such as droughts and floods, which are generally expected to intensify in both magnitude and frequency3,4,5. The impacts of climate change-related hazards are wide-ranging and yet to be fully recognized6,7. Therefore, climate change represents an increasingly relevant topic in many academic fields and global society as a whole8. Adaptation to and mitigation of climate change and its impacts requires that policymakers, researchers, and engaged citizen stakeholders develop and maintain climate literacy in order to plan ahead and implement adaptations efficiently.

Climate literacy refers to the capacity to synthesize information regarding the climate within varying contexts9. Not only researchers and policymakers but also ordinary citizens can benefit from climate literacy. There are multiple reasons for this. First, with adequate climate literacy, individuals can discern the meaning and credibility (or lack of credibility) behind news articles. Moreover, individuals can sufficiently respond to both the economic and environmental ramifications of climate change and apply knowledge of climate change to their careers, as such knowledge impacts a vast array of fields10. In countries like the United States, citizens vote and pay taxes—many of which are relevant to sustainable policies and disaster response plans. If lacking climate literacy, individuals and the organizations and governments they make up may underestimate the urgency of adaptation measures to climate change, waiting to respond until it is too late to avoid the most damaging effects11.

The importance of climate literacy is underscored by younger generations, such as Generation Z, who comprise the future stakeholders that will formulate policies and actions, which will either exacerbate or mitigate the negative impacts of climate change around the globe12. Kuthe et al.12 identify teenagers as a target demographic of top priority since they will be the ones to take on the hazards of climate change, placing much of the future’s environmental conditions in their hands. Moser13 emphasizes this by identifying gaps in the public’s understanding of climate change and related issues. Clearly, climate literacy should be prioritized, especially focusing on younger individuals, for they represent the highest potential for implementing actions that will favorably impact the future14.

Considering the importance of enhancing climate literacy, the recent advances in generative artificial intelligence (AI), such as OpenAI’s ChatGPT and Google’s Bard, may hold meaningful implications for climate literacy. Such generative AI tools are expected to provide a more effective means to obtain new knowledge and information than conventional methods based on web search engines15,16,17, although it should be noted that these tools are not unanimously available. For example, Google’s Bard is not legal in China, nor available in Canada18. Additionally, younger generations comprise those who will be most affected by climate change, as well as those who will carry out any adaptation measures—effectively placing many of the practical outcomes of climate change in the hands of our youth and future generations19,20. This, combined with the growth of nontraditional learning platforms and tools21, has stimulated ongoing discussions among educators about the potential role of generative AI platforms in learning environments22,23,24. In other words, generative AI tools are expected to become essential tools for students and younger generations to improve their climate literacy.

With this growing need to improve the climate literacy of younger generations and the increasingly common use of generative AI platforms, we argue that researchers should examine the potential capabilities and weaknesses (e.g., inaccuracies and biases) of these tools15,16,17, particularly in the context of climate change topics25,26. Without acknowledging the weaknesses of specific AI tools, students may falsely believe such tools function without error, resulting in false information provided to students who integrate it into their understanding of climate change. This may eventually lead to severe educational problems, such as hallucination effects27,28. On the other hand, if generative AI platforms are shown to be sufficiently reliable, they may be used as an accessible means to enhance climate literacy. Our study takes an exploratory approach to this timely issue that, to the best of our knowledge, has yet to be addressed by previous studies.

We select OpenAI’s ChatGPT as our case study (using both GPT-3.5 and GPT-4). While many generative AI tools exist, with many more expected to be released in the near future, we choose to focus on ChatGPT for this case study for the following reasons: First, ChatGPT has experienced the most drastic acceleration of usage since its release29 and second, individuals aged 18-34 currently comprise over 60% of ChatGPT use30. Third, ChatGPT represents a prominent tool and shows an early adoption of usage in developing nations, for which many uses of AI can be identified31. While each specific AI tool should be examined with the same questions in mind, we focus here on ChatGPT as an initial exploration into the issue of climate change literacy and generative AI.

Overall, this study aims to examine the accuracy of ChatGPT’s responses to climate change-related hazards across the globe by comparing responses to credible hazard risk indices, which are based on data used in the International Panel on Climate Change (IPCC) Annual Review 6 (AR6), Working Group II, Chapter 832. We find overall agreement between ChatGPT responses and the hazard risk indices for floods and cyclones, but lower agreement regarding droughts, as well as improved consistency and reduced errors for GPT-4 responses (in comparison to GPT-3.5). This study offers an empirical attempt to systematically investigate the general capabilities and weaknesses of ChatGPT (as of December 2023) regarding country-level vulnerabilities to climate change-related hazards.

Results

Number of climate change-related topics by GPT-4

The topic counts per country from the first iteration of GPT-4 responses are displayed in Fig. 1(a), demonstrating the spatial variation across continents. On average, 9.089 topics are identified with a standard deviation of 1.129. The minimum value is 6, while the maximum value is 12. We further analyze the consistency in ChatGPT’s responses. Figure 1(b) shows the standard deviation of the number of topics per country over the 10 iterations. While the topic count variation remains fairly low for GPT-4 across each continent, many countries in Africa and some countries in the Middle East seem to have the least consistency, suggesting that ChatGPT’s responses are relatively less consistent in these regions compared to other regions.

Fig. 1: Topic count and accuracy results for GPT-4.
figure 1

a Spatial variation of topic counts across the first GPT-4 iteration, where topic count increases from light green to dark blue. b Map of the standard deviation of topic counts for GPT-4 across all ten iterations where standard deviation increases from light yellow to dark red. c Accuracies for droughts (orange triangles and line), cyclones (yellow circles and line), and floods (green squares and line) for GPT-4. Maps and graphs were created by authors.

Accuracy of GPT-4 results compared with the IPCC report

Overall, the first iteration of results created by GPT-4 proved fairly accurate compared to the validation data. Recall that three climate change-related hazard issues were selected for accuracy analysis—floods, droughts, and cyclones—because they create extensive, yet different, damage, thus needing to be monitored as climate change continues. Table 1 shows confusion matrices for each issue across one iteration.

Table 1 Confusion matrices for floods, droughts, and cyclones

Cyclone themes were the most accurate, with an accuracy score of 0.806. This means GPT-4 accurately identified cyclones as a climate change-related hazard 80.6% of the time. 20 false negatives and 17 false positives were produced for this theme. Flooding was accurately mentioned 76.4% of the time. False negatives and false positives for flooding were relatively of the same frequency, with counts of 20 and 25, respectively. There was no substantial difference between false positives and false negatives for floods and cyclones. However, while still having a reasonable accuracy score, droughts were the topic that GPT-4 struggled with most. Droughts were accurately identified 69.1% of the time, and there were 17 more false negatives than false positives.

Additionally, we examine how accuracy scores change for each hazard across 10 iterations, as seen in Fig. 1(c). Specifically, flood accuracy has an 8-percentage point difference with a range of 0.743-0.822, drought accuracy has a 5-percentage point difference with an accuracy range of 0.639-0.651, and cyclone accuracy has a 5-percentage point difference with an accuracy range from 0.770-0.822 across the 10 iterations. Overall, we conclude that accuracy scores for floods, droughts, and cyclones are consistent across all GPT-4 iterations.

Comparison between GPT3.5 and GPT-4 models

Overall, GPT-3.5 (default model, Fig. 2[a]) seems less reliable than GPT-4, which is the most advanced model. For instance, GPT-3.5 showed limited abilities to correctly produce responses in accordance with our prompt directions. As seen in Fig. 2(b), out of 1,910 prompt requests (i.e., 191 countries \(\times\) 10 iterations), 38 outputs from GPT-3.5 were in the incorrect format, which did not allow us to further process the outputs. Note that GPT-4 did not have issues providing the correct format for each of the 1,910 cases, suggesting that it is a more capable and reliable tool than GPT-3.5, at least in this regard.

Fig. 2: Topic count differences between models and general error maps.
figure 2

a Map showing the difference between topic count numbers (GPT-3.5 topic count – GPT-4 topic count). b Map showing the countries where output errors (dark red) occurred in GPT-3.5; light yellow indicates countries for which no error was observed in output responses. Maps were created by authors.

In light of this, assuming that GPT-4’s responses are reliable and accurate sources, Fig. 2b reports many errors, particularly for countries residing in Africa, with the general Europe-Asia boundary area being the second highest in errors. Perhaps the largest difference regarding topic counts between the two models can be identified in South America and Ireland (Fig. 2a). However, in general, all continents appear to have a similar distribution. Regarding the descriptive statistics of results obtained from GPT-3.5, the average number of topics identified by GPT-3.5’s responses is 9.426 with a standard deviation of 1.249. The minimum value of topic counts is 6, while the maximum value of topic counts is 15. The paired sample t-test results indicate that there is a statistically significant difference (p < 0.01) in the number of identified topics by GPT-3.5 and GPT-4.

Discussion

By focusing on ChatGPT as a case study, our exploratory study achieves one of the first steps toward informing users of generative AI tools’ potential strengths and weaknesses relevant to climate change literacy. By comparing three major hazards (floods, droughts, and cyclones) reported for each country by ChatGPT and comparing each to the validation data, we identified more accuracies than inaccuracies in ChatGPT’s responses—but not enough to conclude that the tool, when used in this way, is truly reliable. For example, ChatGPT tends to underestimate vulnerability to droughts, as ChatGPT reports droughts as a primary risk for considerably fewer countries than the trusted validation data do. This presents a false negative type error, which may potentially mislead ChatGPT’s users, who are currently formulating a sense of security and severity. For floods and cyclones, however, the opposite is true: most inaccuracies stem from false positives. Depending on the hazard, these trends in false positives/negatives present important biases and limitations that users should be aware of.

Despite the inaccuracies both types (false positive and false negative) clearly present, a considerable level of agreement is found between the ChatGPT responses and validation data for cyclones and floods. This is confirmed by the high accuracy scores across the 10 iterations of the GPT-4 model. However, the results also report a relatively lower level of agreement for droughts, as evidenced by lower accuracy scores than the other two hazard cases. Overall, our results suggest that, although the false positive bias should be kept in mind, ChatGPT may be used—with caution—as a starting point for users looking to gain climate literacy regarding some hazards, like floods and cyclones. However, considering droughts, more caution should be employed, as false negatives are arguably more dangerous in this context and overall accuracy is lower.

One should naturally ask what the origins of these inaccuracies might be. While identifying true causes is beyond the scope of this exploratory study, we suggest a few possible factors that may influence the performance of ChatGPT in this context. First, we must consider that this study was conducted entirely in English. As OpenAI has acknowledged, a bias toward English and perspectives aligning with Western cultures exists in the AI33. This bias may be relevant both to the responses generated by ChatGPT, which cater to Western, English-speaking users, as well as the AI’s processing of prompts—i.e., it may comprehend prompts from native English-speakers best. This situation is especially important to consider for regions within the Global South, where climate literacy is an important, yet poorly understood issue34. Perceptions of climate change risk vary widely across different cultures35, making even small semantic changes in ChatGPT responses potentially impactful. This language-related bias—in both ChatGPT functioning and user experience—introduces an additional variable to consider, the effects of which are not yet fully understood and may account for general variation in results, if this study were repeated in a non-English language. Additionally, regarding the lower accuracy for droughts (as compared to floods and cyclones), we must consider how such hazards are defined. The IPCC itself has acknowledged that drought is a relative term36, depending on many factors and contexts. Definitions in various sources other than the IPCC and related data sources may, therefore, vary more than other hazards like cyclones, which are more prominent and transparent in definitions (i.e., there is no debate over ongoing cyclones). This could partly explain the decreased accuracy in our validation of droughts, as opposed to floods and cyclones. Overall, this issue related to the definitions of hazards might contribute to the uncertainties of our analytical results, which future studies can examine through sensitivity analyses.

While not completely accurate compared to the validation data, GPT-4 offers a suggested pattern of consistency and reliability in its output regarding topic counts across 10 iterations. However, GPT-3.5 demonstrates unreliability as it produces errors when creating its responses, which we never encountered with GPT-4. Therefore, if possible, our results recommend that users employ GPT-4 rather than GPT-3.5. While it is unsurprising that GPT-4—the more advanced and costly version—performs better than the default version (GPT-3.5), this suggests potential ethical issues regarding tools available to users of different socioeconomic positions37,38,39. These potential ethical concerns can be especially relevant considering that those in developing economies have been some of the fastest populations to adopt applications of ChatGPT31.

We believe that a comprehensive examination of the capabilities of generative AI tools, such as ChatGPT and Bard, will likely grow in value, considering their quickly increasing role in climate literacy25,26 and their potential—yet debated—beneficial applications in the general education sector22,27,40 within countries where it is available. While providing insight into generative AI’s ability to summarize climate change-related hazards on a global, country-level scale, our study contains limitations that should not be overlooked. By utilizing the default parameters and API service that was initiated at each iteration, we provide data that, to the best of our knowledge, is minimally influenced by the user’s prompt history16,41. However, because of the black-box nature of AI models42, it must be noted that individual users may experience different outputs. Further, we recommend that future studies consult OpenAI’s documentation for relevant updates to either GPT version (since December 2023), as OpenAI regularly updates each model. Another variable to consider is that of user demand—might the performance of either version, particularly GPT-3.5, degrade with increased user demand at a given time? Next, we must also consider the limitations of the BERT NLP processing model with which we consolidated the ChatGPT responses into 50 themes. While the NLP model allows us to automate the consolidation process and reduce human error, and we employed the Davies-Bouldin Index, Silhouette, and Within-Cluster Sum of Squares scores (Supplementary Fig. 1), BERT is not perfect, and minor errors in clustering are possible, such as a group of temperature change topics including the more general topic of ‘arctic change.’ However, because BERT takes context into account, such an example may have related to temperature change in the original text. Regardless, this reminds us that BERT functions as a ‘black box’ model, which leaves us with unknowns that, for the time being, we simply accept. Keeping this in mind, we state that, within reasonable feasibility, the BERT model still offers improvements to this study’s approach in accuracy (eliminating human error) and efficiency (completing the same job manually would be nearly impossible, requiring contextual analysis of thousands of topics). Therefore, considering the high sample rate, we conclude that sparse random errors are acceptable for the scope of our study, especially in comparison to a manual approach. Thus, future studies are recommended to mitigate these uncertainties and limitations of the NLP model to provide a more robust theme consolidation result. Finally, the likely bias relating to the English language should be considered in additional cases43.

Further work should continue to comprehensively investigate the performance of the many additional emerging generative AI tools, such as Google’s Bard and ChatClimate (www.chatclimate.ai/)—a customized large language model developed by researchers26 for climate literacy-related use. Future studies are also recommended to quantify the limitations of these tools as precisely and comprehensively as possible. Potential geographic biases resulting from training datasets should also be examined more quantitatively16,44,45. One potential means to conduct further investigation into this issue would be to conduct a Delphi study46, which could offer insights before a wealth of established literature is available. Finally, developing educational recommendations for potential users of these AI tools is essential. More studies are being published, which indicate that prompt engineering and parameter-setting for GPT-4 are key for utilizing the tool effectively16. In light of this, we recommend further studies to examine the factors discussed here and develop best-practice guidelines. While most studies now focus on GPT-4 and its many additional capabilities, it is important to inform users of biases present in GPT-3.5, as many users, especially non-academic, will still use only the default version. This study puts forth an overview of country-level vulnerabilities to climate change-related hazards as told by both versions of ChatGPT as of December 2023.

In conclusion, climate change adaptation strategies will be dependent on the upcoming generations and their climate literacy—people’s understanding of climate change and willingness to be involved in mitigation and adaptation. This is a crucial point in the future of our planet, as projections show that waiting any longer to reduce climate emissions may result in a point of irreversible consequences47. Moreover, considering the growing importance of generative AI tools and their uptake by individuals worldwide, future studies on the combined topic of generative AI tools and climate literacy should commence with the ultimate goal of disseminating findings to enable informed, discerning use of ChatGPT and other increasingly popular generative AI platforms toward the pressing issue of climate change.

Methods

ChatGPT prompting

We performed analyses for both GPT-3.5, the default version, and GPT-4, the advanced version, to monitor any potential response differences between the two. Figure 3 illustrates an overview of the research methods. We formulated a prompt template (Supplementary Note 1) to inquire about a country’s vulnerability to climate change-related hazards. For example, we used the following prompt (similar to that in Kim et al.16) as our input to investigate Australia: “List the climate change-related hazards that Australia is most vulnerable to. Provide a numbered list of the climate change-related hazards with descriptions. Make sure to put a colon between the numbered list and the description. The listed climate change-related hazards should not be duplicated with each other.” This prompt was submitted for the 191 IPCC member countries. To examine to what extent ChatGPT’s responses are consistent in terms of different experiments, the prompt was repeated ten times for each country and each ChatGPT version. In total, 4,018 topics were created by both ChatGPT versions.

Fig. 3: Project workflow.
figure 3

Overview of the research methods. The top row and titles over each cell indicate the main components of our methods, while the respective columns provide more detailed steps and examples; the colors are only for distinguishing columns. The workflow diagram was created by authors.

To process these data effectively, we used OpenAI’s ChatGPT application programming interface (API)43; see Supplementary Note 2. Following an approach by Kim et al.16, we instructed the system to act as a helpful assistant and then began using our prompt. Notice that our prompt template instructs ChatGPT to report the hazards in a list form and place a colon after each topic name is introduced, thus allowing the API to extract all topics accurately and automatically from each response16. Regarding parameters that might affect ChatGPT’s outputs, we used all default settings, including that for temperature (randomness of responses) and max_tokens (response length). Responses per country took approximately 30 seconds to retrieve, on average.

Topic consolidation

To consolidate the topics (4018 once duplicates were removed) into similar topic clusters that are meaningful to be used for analysis, we employed the Bidirectional Encoder Representations from Transformers (BERT), a natural language processing (NLP) model48; see Supplementary Note 3. BERT is an open-source model that incorporates the context behind a word by comparing it to all other words in a sentence48,49,50. This capability allows it to efficiently identify recurring topics mentioned in responses from ChatGPT. K-means + + clustering51,52,53,54 was used to reduce the 4,018 unique topics into 50 topic clusters, which we refer to as themes. We identified 50 clusters as the optimal number of clusters by referring to the Davies-Bouldin Index, the within-cluster sum of squares (WCSS), and silhouette scores16; see Supplementary Fig. 1. From these results, we obtain basic descriptive statistics for GPT-3.5 and GPT-4. We also perform a paired sample t-test on topic counts to test if the difference between the identified topics of the two GPT versions is significant.

Comparison with validation data

To explore the accuracy and consistency of ChatGPT responses, we performed data validation by applying the Index for Risk Management (INFORM) Global Risk Index (GRI)55,56 to the ChatGPT results. The INFORM data set is based on the data within Chapter 8 of the most recent IPCC AR6 Working Group II31 and provides widely accepted, comprehensive measures of risk due to climate-related factors (including hazards), which are offered at the country-level scale. We use the 2019 version, which would, therefore, have been available to be used as training data for ChatGPT. With such reasons in mind, we chose these data for our validation process and hereafter refer to them as the validation data. We used this dataset to validate the ChatGPT responses for floods, droughts, and cyclones—three major climate change-related hazards that were included in both the validation data and ChatGPT response themes. The validation data consists of rankings from “very low” to “very high.” In order to compare these indices with our binary classification of the ChatGPT data, we translated the indices for “medium,” “high,” and “very high” as value 1 (i.e., presented), with anything below medium risk as value 0 (i.e., not presented).

We created confusion matrices between the validation data and ChatGPT responses for each validation hazard. The confusion matrices specify which hazard vulnerabilities the two sources (i.e., ChatGPT responses and validation data) agreed on (true positive [TP] and true negative [TN]) or disagreed on (false positive [FP] and false negative [FN]).

$${accuracy}=\frac{{TP}+{TN}}{{TP}+{TN}+{FP}+{FN}}$$
(1)

By combining the accuracy scores, and basic descriptive statistics for each iteration of responses from GPT-3.5 and GPT-4, we observe if there is a theme of general consistency between iterations and quantify the level of (dis)agreement between ChatGPT and the IPCC.